## 2.2. Regular Expressions

In this section, we will describe the regular languages in another way. First, we define the syntax and the semantics of regular expressions, and then we show that they describe the same class of languages as the class that can be generated by regular grammars.

Definition 13. (Regular expressions). Let T be an alphabet and V = T ∪ {∅, λ, +, ·, *,(,)} be defined as its extension, such that T ∩ {∅, λ, +, ·, *,(,)} = ∅. Then, we define the regular expressions over T as expressions over the alphabet V in an inductive way:

• Base of the induction:

• is a (basic) regular expression,

• λ is a (basic) regular expression,

• every a T is a (basic) regular expression;

• Induction steps

• if p and r are regular expressions, then (p+r) is also a regular expression,

• if p and r are regular expressions, then (p · r) is also a regular expression (we usually eliminate the sign · and write (pr) instead (p · r)),

• if r is a regular expression, then r* is also a regular expression.

Further, every regular expression can be obtained from the basic regular expressions using finite number of induction steps.

Definition 14 (Languages described by regular expressions). Let T be an alphabet. Then, we define languages described by the regular expressions over the alphabet T following their inductive definition.

• Basic languages:

• refers to the empty language {},

• λ refers to the language {λ},

• for every aT, the expression refers to the language {a};

• Induction steps:

• if p and r refer to the languages Lp and Lr, respectively, then the regular expressions (p+r) refers to the language Lp Lr,

• if p and r refer to the languages Lp and Lr, respectively, then the regular expressions (p · r) or (pr) refers to the language Lp · Lr,

• if r refers to a language Lr then r* refers to the language Lr*.

The language operations used in the above definition are the regular operations:

• the addition (+) is the union of two languages,

• the product is the concatenation of two languages, and

• the reflexive and transitive closure of concatenation is the (Kleene-star) iteration of languages.

Two regular expressions are equivalent if they describe the same language. Here are some examples:

 (p+r) ≡ (r+p) (commutativity of union) ((p+r)+q) ≡ (p+(r+q)) (associativity of union) (r+∅) ≡ r (additive zero element, unit element for union) ((pr)q) ≡ (p(rq) (associativity of concatenation) (rλ) ≡ (λr) ≡ r (unit element for concatenation) ((p+r)q) ≡ ((pq)+(rq)) (left distributivity) (p(r+q)) ≡ ((pr)+(pq)) (right distributivity) (r∅) ≡ ∅ (zero element for concatenation) λ * ≡ λ (iteration of the unit element) ∅ * ≡ λ (iteration of the zero element) (rr*) ≡ (r*r) (positive iteration)

Actually, the values of the last equivalence are frequently used, and they are denoted by r+, i.e., r+rr* by definition. This type of iteration which does not allow for 0 times iteration, i.e., only positive numbers of iterations are allowed, is usually called Kleene-plus iteration.

With the use of the above equivalences we can write most of the regular expressions in a shorter form: some of the brackets can be eliminated without causing any ambiguity in the language described. The elimination of the brackets is usually based on the associativity (both for the union and for the concatenation). Some further brackets can be eliminated by fixing a precedence order among the regular operations: the unary operation (Kleene-)star is the strongest, then the concatenation (the product), and (as usual) the union (the addition) is the weakest.

We have regular grammars to generate languages and regular expressions to describe languages, but these concepts are not independent. First we will prove one direction of the equivalence between them.

Theorem 3. Every language described by a regular expression can be generated by a regular grammar.

Proof. The proof goes by induction based on the definition of regular expressions. Let r be a basic regular expression, then

• if r is ∅, then the empty language can be generated by the regular grammar

(S, A, T, S, {A → a});

• if r is λ, then the language {λ} can be generated by the regular grammar

(S, T, S, {S → λ});

• if r is a for a terminal aT, then the language {a} is generated by the regular grammar

(S, T, S, {S → a}).

If r is not a basic regular expression then the following cases may occur:

• r is (p+q) with some regular expressions p,q such that the regular grammars Gp = (Np, T, Sp, Pp) and Gq = (Nq, T, Sq, Pq) generate the languages described by expression p and q, respectively, where NpNq = ∅ (this can be done by renaming nonterminals of a grammar without affecting the generated language). Then let

G = (NpNq ∪ {S}, T, S, PpPq ∪ {SSp, S→ Sq}),

where SNpNq is a new symbol. It can be seen that G generates the language described by expression r.

• r is (p · q) with some regular expressions p,q such that the regular grammars Gp = (Np, T, Sp, Pp) and Gq = (Nq, T, Sq, Pq) generate the languages described by expression p and q, respectively, where NpNq = ∅. Then let

G = (NpNq,T, Sp, Pq ∪ {AuBAuBPp, A,BNp, uT*} ∪ {AuSqAuPp, ANp, uT*}).

It can be shown that G generates the language described by expression r.

• r is a regular expression of the form q* for a regular expression q. Further let Gq = (Nq, T, Sq, Pq) be a regular grammar that generates the language described by expression q. Then, let

G = (Nq ∪ {S},T,S, Pq ∪ {S → λ, SSq} ∪ {AuSqAuPq, ANq, uT*}),

where SNq. It can be shown that G generates the language described by expression r.

Since every regular expression built by finitely many applications of the induction step, for any regular expression one can construct a regular grammar such that the grammar generates the language described by the expression.

QED.

When we want to have a grammar that generates the language given by a regular expression, the process can be faster, if we know that every singleton, i.e., language containing only one word, can easily be obtained by a grammar having only one nonterminal (the start symbol) and only one production that allows to generate the given word in one step.

Example 20. Let r = (cbb)*(ab+ba) be a regular expression (that describes a language over the alphabet {a,b,c}). Give a regular grammar that generates this language.

Solution:

Let us build up r from its subexpressions. According to the above observation, the language {cbb} can be generated by the grammar

Gcbb =({Scbb},{a,b,c},Scbb,{Scbbcbb}).

Now, let us use our induction step to obtain the grammar G(cbb)* that generates the language (cbb)*: then

G(cbb)* = ({Scbb, S(cbb)*},{a,b,c},S(cbb)*,

{Scbbccb, S(cbb)* → λ, S(cbb)*Scbb, S(cbb)cbb S(cbb)}).

The languages {ab} and {ba} can be generated by grammars

Gab = ({Sab},{a,b,c},Sab,{Sabab})

and

Gab = ({Sba},{a,b,c},Sba,{Sbaba}),

respectively. Their union, ab+ba, then is generated by the grammar

Gab+ba = ({Sab,Sba,Sab+ba},{a,b,c},Sab+ba,

{Sabab, Sbaba, Sab+baSab, Sab+baSba})

according to our induction step.

Finally, we need the concatenation of the previous expressions (cbb)* and (ab+ba), and it is generated by the grammar

G(cbb*)(ab+ba) = ({Scbb, S(cbb)*,Sab,Sba,Sab+ba}, {a,b,c}, S(cbb)*,

{ScbbcbbSab+ba, S(cbb)*Sab+ba, S(cbb)*Scbb, S(cbb)*cbb S(cbb)*,

Sabab, Sbaba, Sab+baSab, Sab+baSba})

due to our induction step. The problem is solved.

Exercise 12. Give a regular expression that describes the language containing exactly those words that contain three consecutive a's over the alphabet {a,b}.

Exercise 13. Give a regular expression that describes the language containing exactly those words that do not contain two consecutive a's (over the alphabet {a,b}).

Exercise 14. Give a regular grammar that generates the language 0*(1+22)(2*+00).

Exercise 15. Give a regular grammar that generates the language 0+1(1+0)*.

Exercise 16. Give a regular grammar that generates the language (a+bb(b+(cc)*))*(ababa+c*).