2. fejezet - Regular Languages and Finite Automata

Tartalom

2.1. Regular Grammars
2.2. Regular Expressions
2.3. Finite Automata as Language Recognizers
2.3.1. Synthesis and Analysis of Finite Automata
2.3.2. The Word Problem
2.4. Properties of Regular Languages
2.4.1. Closure Properties
2.4.2. Myhill-Nerode theorem
2.5. Finite Transducers
2.5.1. Mealy Automata
2.5.2. Moore Automata
2.5.3. Automata Mappings

Summary of the chapter: In this chapter, we deal with the simplest languages of the Chomsky hierarchy, i.e., the regular languages. We show that they can be characterized by regular expressions. Another description of this class of languages can be given by finite automata: both the class of deterministic finite automata and the class of nondeterministic finite automata accept this class. The word problem (parsing) can be solved in real-time for this class by the deterministic finite automata. This class is closed under regular and under set-theoretical operations. This class also has a characterization in terms of analyzing the classes of possible continuations of the words (Myhill-Nerode theorem). We also present Mealy-type and Moore-type transducers: finite transducers are finite automata with output.

2.1. Regular Grammars

In order to be comprehensive we present the definition of regular grammars here again.

Definition (Regular grammars). A grammar G=(N,T,S,P) is regular if each of its productions has one of the following forms: A → u, A → uB, where A,B∈ N, u∈ T*. The languages that can be generated by regular grammars are the regular languages (they are also called type 3 languages of the Chomsky hierarchy).

Animation 2. presents an example for a derivation in a regular grammar.

Animation 2.

We note here that the grammars and languages of the definition above are commonly referred to as right-linear grammars and languages, and regular grammars and languages are defined in a more restricted way:

Definition 12. (Alternative definition of regular grammars). A grammar G = ( N, T, S, P ) is regular if each of its productions has one of the following forms: A → a, A → aB, S → λ, where A,B N, a T. The languages that can be generated by these grammars are the regular languages.

Now we show that the two definitions are equivalent in the sense that they define the same class of languages.

Theorem 2. The language classes defined by our original definition and by the alternative definition coincide.

Proof. The proof consists of two parts: we need to show that languages generated by grammars of one definition can also be generated by grammars of the other definition.

It is clear the grammars satisfying the alternative definition satisfy our original definition at the same time, therefore, every language that can be generated by the grammars of the alternative definition can also be generated by grammars of the original definition.

Let us consider the other direction. Let a grammar G = ( N, T, S, P ) may contain rules of types Au and AuB (A,BN, uT*). Then, we give a grammar G' = ( N', T, S', P' ) such that it may only contain rules of the forms Aa, AaB, S' → λ (where A,BN', aT) and it generates the same language as G, i.e., L (G) = L (G').

First we obtain a grammar G'' such that L (G) = L (G'') and G'' = ( N'', T, S, P'' ) may contain only rules of the following forms: Aa, AaB, AB, A → λ (where A,BN'', aT). Let us check every rule in P: if it has one of the forms above, then we copy it to the set P'', else we will do the following:

  • if the rule is of the form Aa1... ak for k > 1, where aiT, i ∈{1,...,k}, then let the new nonterminals X1,..., Xk-1 be introduced (new set for every rule) and added to the set N'' and instead of the actual rule the next set of rules is added to P'': Aa1X1, X1a2X2, ..., Xk-2ak-1Xk-1, Xk-1ak.

  • if the rule is of the form Aa1... akB for k > 1, where aiT, i ∈ {1,...,k}, then let the new nonterminals X1,..., Xk-1 be introduced (new set for every rule) and put to the set N'', and instead of the actual rule the next set of rules is added to P'': Aa1X1, X1a2X2, ..., Xk-2ak-1Xk-1, Xk-1akB.

When every rule is analyzed (and possibly replaced by a set of rules) we have grammar G''. It is clear that the set of productions P'' of G'' may contain only rules of the forms Aa, AaB, Aa, A → λ (where A,BN'', aT), since we have copied only rules from P that have these forms, and all the rules that are added to the set of productions P'' by replacing a rule are of the forms AaB, Aa (where A,BN'', aT). Moreover, exactly the same words can be derived in G'' and in G. The derivation graphs in the two grammars can be mapped in a bijective way. When a derivation step is applied in G with a rule Aa1... that is not in P'', then the rules must be used in G'' that are used to replace the rule Aa1...: applying the first added rule first Aa1X1, then there is only one rule that can be applied in G'', since there is only one rule added with X1 in its left hand side... therefore, one needs to apply all the rules of the chain that was used to replace the original rule Aa1.... Then, if there is a nonterminal B at the end of the rule Aa1..., then it is located at the end of the last applied rule of the chain, and thus the derivation can be continued in the same way in both grammars. The other way around, if we use a newly introduced rule in the derivation in G'', then we must use all the rules of the chain, and also we can use the original rule that was replaced by this chain of rules in the grammar G. In this way, there is a one-to-one correspondence in the completed derivations of the two grammars. (See Figure 2.1. for an example of replacing a long rule by a sequence of shorter rules.)

2.1. ábra - In derivations the rules with long right hand side are replaced by chains of shorter rules, resulting binary derivation trees in the new grammar.

In derivations the rules with long right hand side are replaced by chains of shorter rules, resulting binary derivation trees in the new grammar.

Now we may have some rules in P'' that do not satisfy the alternative definition. The form of these rules can only be AB and C → λ (where A,B,CN'', CS). The latter types of rules can be eliminated by the Empty-word lemma (see Theorem 1.). Therefore, we can assume that we have a grammar G''' = ( N''', T, S', P''' ) such that L (G''') = L (G) and P''' may contain only rules of the forms AaB, AB, Aa, S' → λ (where A,BN''', aT and in case S' → λ ∈ P''' the start symbol S' does not occur on the right hand side of any of the rules of P'''). Let us define the following sets of nonterminals:

  • let U1 (A) = {A},

  • let Ui+1 (A) = Ui (A) ∪ {BN'''∣∃ CUi (A) such that CBP'''}, for i > 1.

Since N''' is finite there is a minimal index k such that Uk(A) = Uk+1(A). Let U(A) denote the set Uk(A) with the above property. In this way, U(A) contains exactly those nonterminals that can be derived from A by using rules only of the form BC (where B,CN'''). We need to replace the parts A* Br of the derivations in G''' by a direct derivation step in our new grammar, therefore, let G' = ( N''', T, S', P' ), where P' = {Ar∣∃ BN''' such that BrP''', rN''' and BU(A)}. Then G' fulfills the alternative definition, moreover, it generates the same language as G''' and G.

QED.

Further we will call a grammar a regular grammar in normal form if it satisfies our alternative definitions. In these grammars the structures of the rules are more restricted: if we do not derive the empty word, then we can use only rules that have exactly one terminal in their right hand side.

Example 19. Let

G = ({S,A,B},{0,1,2},S,

{  S → 010B, 
   AB,
   A → 2021, 
   B → 2A, 
   BS,
   B → λ 
}). 

Give a grammar that is equivalent to G and is in normal form.

Solution:

Let us exclude first the rules that contain more than one terminal symbols. Such rules are S → 010B and A → 2021. Let us substitute them by the sets

{S → 0X1, X1 → 1X2, X2 → 0B}

and

{A → 2X3, X3 → 0X4, X4 → 2X5, X5 → 1}

of rules, respectively, where the newly introduced nonterminals are {X1, X2} and {X3, X4, X5}, respectively. The obtained grammar is

G'' = ({S, A, B, X1, X2, X3, X4, X5},{0,1,2},S,

{  S → 0X1,
   X1 → 1X2, 
   X2 → 0B,
   AB,
   A → 2X3,
   X3 → 0X4, 
   X4 → 2X5,
   X5 → 1,
   B → 2A, 
   BS, 
   B → λ
 ).

Now, by applying the Empty-word lemma, we can exclude rule B → λ (the empty word can be derived from the nonterminals B and A in our example) and by doing so we obtain grammar

G''' = ({S, A, B, X1, X2, X3, X4, X5}, {0,1,2}, S,

{  S → 0X1,
   X1 → 1X2,
   X2 → 0B,
   X2 → 0,
   AB,
   A → 2X3,
   X3 → 0X4,
   X4 → 2X5,
   X5 → 1,
   B → 2A,
   B → 2,
   BS
}).

Now we are excluding the chain rules A B, B S. To do this step, first, we obtain the following sets:

U0 (S) = {S}  U0 (A) = {A}  U0 (B) = {B}
U1 (S) = {S} = U(S)  U1 (A) = {A,B}  U1 (B) = {B,S}
 U2 (A) = {A,B,S}  U2 (B) = {B,S} = U(B)
 U3 (A) = {A,B,S} = U(A)  

Actually for those nonterminals that are not appearing in chain rules these sets are the trivial sets, e.g., U (X1) = U0 (X1) = {X1}. Thus, finally, we obtain grammar

G' = ({S, A, B, X1, X2, X3, X4, X5}, {0,1,2}, S,

{  S → 0X1,
   A → 0X1,
   B → 0X1,
   X1 → 1X2, 
   X2 → 0B,
   X2 → 0,
   A → 2X3,
   X3 → 0X4,
   X4 → 2X5,
   X5 → 1,
   B → 2A,
   A → 2A,
   B → 2,
   A → 2 
}).

Since our transformations preserve the generated language, every obtained grammar (G'', G''' and also G') is equivalent to G. Moreover, G' is in normal form. Thus the problem is solved.

Exercise 8. Let

G = ({S, A, B, C},{a,b},S,

{  SabA, 
   SA,
   AB, 
   Babab,
   BaA,
   BaaC, 
   B → λ, 
   CaaS 
}). 

Give a grammar that is equivalent to G and is in normal form.

Exercise 9. Let

G = ({S, A, B, C},{a,b,c},S,

{  SA,
   SB,
   AaaB,
   BA,
   BacS,
   BC, 
   Cc,
   C → λ 
}). 

Give a grammar that is equivalent to G and is in normal form.

Exercise 10. Let

G = ({S, A, B, C, D},{a,b,c},S,

{  SaA,
   SbB,
   AB,
   BA,
   BccccC,
   BacbcB,
   CcaacA, 
   Ccba
}). 

Give a grammar that is equivalent to G and is in normal form.

Exercise 11. Let

G = ({S, A, B, C, D},{1,2,3,4},S,

{  S → 11A, 
   S → 12B,
   AB,
   BC,
   B → 14,
   B → 4431,
   C → 3D,
   D → 233C 
}). 

Give a grammar that is equivalent to G and is in normal form.