2.2. Regular Expressions

In this section, we will describe the regular languages in another way. First, we define the syntax and the semantics of regular expressions, and then we show that they describe the same class of languages as the class that can be generated by regular grammars.

Definition 13. (Regular expressions). Let T be an alphabet and V = T ∪ {∅, λ, +, ·, *,(,)} be defined as its extension, such that T ∩ {∅, λ, +, ·, *,(,)} = ∅. Then, we define the regular expressions over T as expressions over the alphabet V in an inductive way:

Further, every regular expression can be obtained from the basic regular expressions using finite number of induction steps.

Definition 14 (Languages described by regular expressions). Let T be an alphabet. Then, we define languages described by the regular expressions over the alphabet T following their inductive definition.

The language operations used in the above definition are the regular operations:

Two regular expressions are equivalent if they describe the same language. Here are some examples:

(p+r)(r+p)(commutativity of union)
((p+r)+q)(p+(r+q))(associativity of union)
(r+∅) r(additive zero element, unit element for union)
((pr)q) (p(rq)(associativity of concatenation)
(rλ) r) ≡ r(unit element for concatenation)
((p+r)q) ((pq)+(rq))(left distributivity)
(p(r+q)) ((pr)+(pq))(right distributivity)
(r∅)(zero element for concatenation)
λ *λ(iteration of the unit element)
*λ(iteration of the zero element)
(rr*)(r*r)(positive iteration)

Actually, the values of the last equivalence are frequently used, and they are denoted by r+, i.e., r+rr* by definition. This type of iteration which does not allow for 0 times iteration, i.e., only positive numbers of iterations are allowed, is usually called Kleene-plus iteration.

With the use of the above equivalences we can write most of the regular expressions in a shorter form: some of the brackets can be eliminated without causing any ambiguity in the language described. The elimination of the brackets is usually based on the associativity (both for the union and for the concatenation). Some further brackets can be eliminated by fixing a precedence order among the regular operations: the unary operation (Kleene-)star is the strongest, then the concatenation (the product), and (as usual) the union (the addition) is the weakest.

We have regular grammars to generate languages and regular expressions to describe languages, but these concepts are not independent. First we will prove one direction of the equivalence between them.

Theorem 3. Every language described by a regular expression can be generated by a regular grammar.

Proof. The proof goes by induction based on the definition of regular expressions. Let r be a basic regular expression, then

If r is not a basic regular expression then the following cases may occur:

Since every regular expression built by finitely many applications of the induction step, for any regular expression one can construct a regular grammar such that the grammar generates the language described by the expression.

QED.

When we want to have a grammar that generates the language given by a regular expression, the process can be faster, if we know that every singleton, i.e., language containing only one word, can easily be obtained by a grammar having only one nonterminal (the start symbol) and only one production that allows to generate the given word in one step.

Example 20. Let r = (cbb)*(ab+ba) be a regular expression (that describes a language over the alphabet {a,b,c}). Give a regular grammar that generates this language.

Solution:

Let us build up r from its subexpressions. According to the above observation, the language {cbb} can be generated by the grammar

Gcbb =({Scbb},{a,b,c},Scbb,{Scbbcbb}).

Now, let us use our induction step to obtain the grammar G(cbb)* that generates the language (cbb)*: then

G(cbb)* = ({Scbb, S(cbb)*},{a,b,c},S(cbb)*,

{Scbbccb, S(cbb)* → λ, S(cbb)*Scbb, S(cbb)cbb S(cbb)}).

The languages {ab} and {ba} can be generated by grammars

Gab = ({Sab},{a,b,c},Sab,{Sabab})

and

Gab = ({Sba},{a,b,c},Sba,{Sbaba}),

respectively. Their union, ab+ba, then is generated by the grammar

Gab+ba = ({Sab,Sba,Sab+ba},{a,b,c},Sab+ba,

{Sabab, Sbaba, Sab+baSab, Sab+baSba})

according to our induction step.

Finally, we need the concatenation of the previous expressions (cbb)* and (ab+ba), and it is generated by the grammar

G(cbb*)(ab+ba) = ({Scbb, S(cbb)*,Sab,Sba,Sab+ba}, {a,b,c}, S(cbb)*,

{ScbbcbbSab+ba, S(cbb)*Sab+ba, S(cbb)*Scbb, S(cbb)*cbb S(cbb)*,

Sabab, Sbaba, Sab+baSab, Sab+baSba})

due to our induction step. The problem is solved.

Exercise 12. Give a regular expression that describes the language containing exactly those words that contain three consecutive a's over the alphabet {a,b}.

Exercise 13. Give a regular expression that describes the language containing exactly those words that do not contain two consecutive a's (over the alphabet {a,b}).

Exercise 14. Give a regular grammar that generates the language 0*(1+22)(2*+00).

Exercise 15. Give a regular grammar that generates the language 0+1(1+0)*.

Exercise 16. Give a regular grammar that generates the language (a+bb(b+(cc)*))*(ababa+c*).