Chapter 1. Automata and Formal Languages

Automata and formal languages play an important role in projecting and realizing compilers. In the first section grammars and formal languages are defined. The different grammars and languages are discussed based on Chomsky hierarchy. In the second section we deal in detail with the finite automata and the languages accepted by them, while in the third section the pushdown automata and the corresponding accepted languages are discussed. Finally, references from a rich bibliography are given.

1.1. 1.1 Languages and grammars

A finite and nonempty set of symbols is called an alphabet. The elements of an alphabet are letters, but sometimes are named also symbols.

With the letters of an alphabet words are composed. If then is a word over the alphabet (the letters are not necessary distinct). The number of letters of a word, with their multiplicities, constitutes the length of the word. If then the length of is If then the word is an empty word, which will be denoted by (sometimes in other books). The set of words over the alphabet will be denoted by :

For the set of nonempty words over the notation will be used. The set of words of length over will be denoted by , and Then

The words and are equal (i.e. ), if and

We define in the binary operation called concatenation. The concatenation (or product) of the words and is the word . It is clear that This operation is associative but not commutative. Its neutral element is because for all . with the concatenation is a monoid.

We introduce the power operation. If then and for The reversal (or mirror image) of the word is . The reversal of sometimes is denoted by or . It is clear that and

Word is a prefix of the word if there exists a word such that . If then is a proper prefix of . Similarly is a suffix of if there exists a word such that . The proper suffix can also be defined. Word is a subword of the word if there are words and such that If then is a proper subword.

A subset of is called a language over the alphabet . Sometimes this is called a formal language because the words are here considered without any meanings. Note that is the empty language while is a language which contains the empty word.

1.1.1. 1.1.1 Operations on languages

If are languages over we define the following operations

union

intersection

difference

complement

multiplication

power

if

iteration or star operation

mirror

We will use also the notation

The union, product and iteration are called regular operations.

1.1.2. 1.1.2 Specifying languages

Languages can be specified in several ways. For example a language can be specified using

1) the enumeration of its words,

2) a property, such that all words of the language have this property but other word have not,

3) a grammar.

1.1.2.1.  Specifying languages by listing their elements.

For example the following are languages

Even if we cannot enumerate the elements of an infinite set infinite languages can be specified by enumeration if after enumerating the first some elements we can continue the enumeration using a rule. The following is such a language

1.1.2.2.  Specifying languages by properties.

The following sets are languages

where denotes the number of letters in word and the number of letters .

1.1.2.3.  Specifying languages by grammars.

Define the generative grammar or shortly the grammar.

Definition 1.1 A grammar is an ordered quadruple , where

is the alphabet of variables (or nonterminal symbols),

is the alphabet of terminal symbols, where ,

is a finite set, that is is the finite set of productions of the form , where and contains at least a nonterminal symbol,

is the start symbol.

Remarks. Instead of the notation sometimes is used.

In the production or word is called the left-hand side of the production while the right-hand side. If for a grammar there are more than one production with the same left-hand side, then these production

We define on the set the relation called direct derivation

In fact we replace in an appearance of the subword by and we get . Another notations for the same relation can be or .

If we want to emphasize the used grammar , then the notation can be replaced by . Relation is the reflexive and transitive closure of , while denotes its transitive closure. Relation is called a derivation.

From the definition of a reflexive and transitive relation we can deduce the following: , if there exist the words and . This can be written shortly If then . The same way we can define the relation except that always, so at least one direct derivation will de used.

Definition 1.2 The language generated by grammar is the set

So contains all words over the alphabet which can be derived from the start symbol using the productions from .

Example 1.1 Let where

It is easy to see than because

where up to the last but one replacement the first production () was used, while at the last replacement the production . This derivation can be written Therefore can be derived from for all and no other words can be derived from .

Definition 1.3 Two grammars and are equivalent, and this is denoted by if .

Example 1.2 The following two grammars are equivalent because both of them generate the language .

, where

,

where

First let us prove by mathematical induction that for If then

The inductive hypothesis is We use production , then times production , and then production , afterwards again times production . Therefore

If now we use production we get for , but by the production , so for any . We have to prove in addition that using the productions of the grammar we cannot derive only words of the form . It is easy to see that a successful derivation (which ends in a word containing only terminals) can be obtained only in the presented way.

Similarly for

Here orderly were used the productions ( times), , ( times), , ( times), , ( times). But So , . It is also easy to see than other words cannot be derived using grammar .

The grammars and are not equivalent because .

Theorem 1.4 Not all languages can be generated by grammars.

Proof. We encode grammars for the proof as words on the alphabet . For a given grammar let and The encoding is the following:

the code of is the code of is

In the code of the grammar the letters are separated by 000, the code of the arrow is 0000, and the productions are separated by 00000.

It is enough, of course, to encode the productions only. For example, consider the grammar

.

The code of is 10101, the code of is 1001001, the code of is 10011001. The code of the grammar is

From this encoding results that the grammars with terminal alphabet can be enumerated as and the set of these grammars is a denumerable infinite set.

Footnote Let us suppose that in the alphabet there is a linear order , let us say . The words which are codes of grammars can be enumerated by ordering them first after their lengths, and inside the equal length words, alphabetically, using the order of their letters. But we can use equally the lexicographic order, which means that ( is before ) if is a proper prefix of or there exists the decompositions and , where , , are subwords, and letters with .

Consider now the set of all languages over denoted by , that is . The set is denumerable because its words can be ordered. Let this order , where . We associate to each language an infinite binary sequence the following way:

It is easy to see that the set of all such binary sequences is not denumerable, because each sequence can be considered as a positive number less than 1 using its binary representation (The decimal point is considered to be before the first digit). Conversely, to each positive number less than 1 in binary representation a binary sequence can be associated. So, the cardinality of the set of infinite binary sequences is equal to cardinality of interval , which is of continuum power. Therefore the set is of continuum cardinality. Now to each grammar with terminal alphabet associate the corresponding generated language over . Since the cardinality of the set of grammars is denumerable, there will exist a language from , without associated grammar, a language which cannot be generated by a grammar.

1.1.3. 1.1.3 Chomsky hierarchy of grammars and languages

Putting some restrictions on the form of productions, four type of grammars can be distinguished.

Definition 1.5 Define for a grammar the following four types.

A grammar is of type 0 (phrase-structure grammar) if there are no restrictions on productions.

A grammar is of type 1 (context-sensitive grammar) if all of its productions are of the form , where , , . A production of the form can also be accepted if the start symbol does not occur in the right-hand side of any production.

A grammar is of type 2 (context-free grammar) if all of its productions are of the form , where , . A production of the form can also be accepted if the start symbol does not occur in the right-hand side of any production.

A grammar is of type 3 (regular grammar) if its productions are of the form or , where and . A production of the form can also be accepted if the start symbol does not occur in the right-hand side of any production.

If a grammar is of type then language is also of type .

This classification was introduced by Noam Chomsky.

A language is of type () if there exists a grammar of type which generates the language , so .

Denote by () the class of the languages of type . Can be proved that

By the definition of different type of languages, the inclusions () are evident, but the strict inclusions () must be proved.

Example 1.3 We give an example for each type of context-sensitive, context-free and regular grammars.

Context-sensitive grammar. , where

Elements of are:

Language contains words of the form with and .

Context-free grammar. , where

Elements of are:

Language contains algebraic expressions which can be correctly built using letter , operators and and brackets.

Regular grammar. , where

Elements of are:

Language contains words over the alphabet with at least two letters at the beginning.

It is easy to prove that any finite language is regular. The productions will be done to generate all words of the language. For example, if is in the language, then we introduce the productions: , , , where is the start symbol of the language and are distinct nonterminals. We define such productions for all words of the language using different nonterminals for different words, excepting the start symbol . If the empty word is also an element of the language, then the production is also considered.

The empty set is also a regular language, because the regular grammar generates it.

1.1.3.1.  Eliminating unit productions.

A production of the form is called a unit production, where . Unit productions can be eliminated from a grammar in such a way that the new grammar will be of the same type and equivalent to the first one.

Let be a grammar with unit productions. Define an equivalent grammar without unit productions. The following algorithm will construct the equivalent grammar.

Eliminate-Unit-Productions( )

  1  if the unit productions  and  are in  put also        the unit production  in  while  can be extended  2  if the unit production  and the production  () are in         put also the production  in   3  let  be the set of productions of  except unit productions   4  
                           RETURN
                         
                         
                     

Clearly, and are equivalent. If is of type then is also of type

Example 1.4 Use the above algorithm in the case of the grammar , where contains

Using the first step of the algorithm, we get the following new unit productions:

            (because of and ),

            (because of and ),

            (because of and ),

            (because of and ),

            (because of and ),

            (because of and ).

In the second step of the algorithm will be considered only productions with or in the right-hand side, since productions , and can be used (the other productions are all unit productions). We get the following new productions:

            (because of and ),

              (because of and ),

            (because of and ),

            (because of and ),

            (because of and ).

The new grammar will have the productions:

1.1.3.2.  Grammars in normal forms.

A grammar is to be said a grammar in normal form if its productions have no terminal symbols in the left-hand side.

We need the following notions. For alphabets and a homomorphism is a function for which . It is easy to see that for arbitrary value is uniquely determined by the restriction of on , because

If a homomorphism is a bijection then is an isomorphism.

Theorem 1.6 To any grammar an equivalent grammar in normal form can be associated.

Proof. Grammars of type 2 and 3 have in left-hand side of any productions only a nonterminal, so they are in normal form. The proof has to be done for grammars of type 0 and 1 only.

Let be the original grammar and we define the grammar in normal form as .

Let be those terminal symbols which occur in the left-hand side of productions. We introduce the new nonterminals . The following notation will be used: , , and .

Define the isomorphism , where

Define the set of production as

In this case if and only if From this the theorem immediately results because

Example 1.5 Let , where contains

In the left-hand side of productions the terminals occur, therefore consider the new nonterminals , and include in also the new productions , and .

Terminals will be replaced by nonterminals respectively, and we get the set as

Let us see what words can be generated by this grammars. It is easy to see that because

, so

We prove, using the mathematical induction, that for . For this is the case, as we have seen before. Continuing the derivation we get , and this is what we had to prove.

But So . These words can be generated also in .

1.1.4. 1.1.4 Extended grammars

In this subsection extended grammars of type 1, 2 and 3 will be presented.

Extended grammar of type 1. All productions are of the form , where , excepted possibly the production .

Extended grammar of type 2. All productions are of the form , where

Extended grammar of type 3. All productions are of the form or , where .

Theorem 1.7 To any extended grammar an equivalent grammar of the same type can be associated.

Proof. Denote by the extended grammar and by the corresponding equivalent grammar of the same type.

Type 1. Define the productions of grammar by rewriting the productions , where of the extended grammar in the form allowed in the case of grammar by the following way.

Let () be a production of , which is not in the required form. Add to the set of productions of the following productions, where are new nonterminals:

Furthermore, add to the set of productions of without any modification the productions of which are of permitted form, i.e. . Inclusion can be proved because each used production of in a derivation can be simulated by productions obtained from it. Furthermore, since the productions of can be used only in the prescribed order, we could not obtain other words, so also is true.

Type 2. Let . Productions of form have to be eliminated, only can remain, if doesn't occur in the right-hand side of productions. For this define the following sets:

Since for we have , and is a finite set, there must exists such a for which . Let us denote this set as . It is easy to see that a nonterminal is in if and only if . (In addition if and only if .)

We define the productions of starting from the productions of in the following way. For each production with of add to the set of productions of this one and all productions which can be obtained from it by eliminating from one or more nonterminals which are in , but only in the case when the right-hand side does not become .

It in not difficult to see that this grammar generates the same language as does, except the empty word . So, if then the proof is finished. But if , then there are two cases. If the start symbol does not occur in any right-hand side of productions, then by introducing the production , grammar will generate also the empty word. If occurs in a production in the right-hand side, then we introduce a new start symbol and the new productions and . Now the empty word can also be generated by grammar .

Type 3. First we use for the procedure defined for grammars of type 2 to eliminate productions of the form . From the obtained grammar we eliminate the unit productions using the algorithm ELIMINATE-UNIT-PRODUCTIONS .

In the obtained grammar for each production , where , add to the productions of also the followings

where are new nonterminals. It is easy to prove that grammar built in this way is equivalent to .

Example 1.6 Let be an extended grammar of type 1, where , and contains the following productions:

The only production which is not context-sensitive is . Using the method given in the proof, we introduce the productions:

Now the grammar is context-sensitive, where the elements of are

It can be proved that .

Example 1.7 Let be an extended grammar of type 2, where contains:

Then , , . The productions of the new grammar are:

The original grammar generates also the empty word and because occurs in the right-hand side of a production, a new start symbol and two new productions will be defined: . The context-free grammar equivalent to the original grammar is with the productions:

Both of these grammars generate language .

Example 1.8 Let be the extended grammar of type 3 under examination, where :

First, we eliminate production . Since , the productions will be

The latter production (which a unit production) can also be eliminated, by replacing it with . Productions and have to be transformed. Since, both productions have the same right-hand side, it is enough to introduce only one new nonterminal and to use the productions and instead of . Production will be replaced by . The new grammar is , where :

Can be proved that .

1.1.5. 1.1.5 Closure properties in the Chomsky-classes

We will prove the following theorem, by which the Chomsky-classes of languages are closed under the regular operations, that is, the union and product of two languages of type is also of type , the iteration of a language of type is also of type ().

Theorem 1.8 The class () of languages is closed under the regular operations.

Proof. For the proof we will use extended grammars. Consider the extended grammars and of type each. We can suppose that .

Union. Let .

We will show that If then from the assumption that and are of type follows by definition that also is of type . If and one of the grammars generates the empty word, then we eliminate from the corresponding production (possibly the both) () and replace it by production .

Product. Let .

We will show that By definition, if then will be of the same type. If and there is production in but there is no production in then production will be replaced by . We will proceed the same way in the symmetrical case. If there is in production and in production then they will be replaced by .

In the case of regular grammars (), because is not a regular production, we need to use another grammar , where the difference between and lies in that instead of productions in the form in will exist production of the form .

Iteration. Let .

In the case of grammars of type 2 let . Then also is of type 2.

In the case of grammars of type 3, as in the case of product, we will change the productions, that is , where the difference between and lies in that for each will be replaced by , and the others will be not changed. Then also will be of type 3.

The productions given in the case of type 2 are not valid for , because when applying production we can get the derivations of type , , , where can be a left-hand side of a production. In this case, replacing by its right-hand side in derivation , we can generate a word which is not in the iterated language. To avoid such situations, first let us assume that the language is in normal form, i.e. the left-hand side of productions does not contain terminals (see Section 1.1), second we introduce a new nonterminal , so the set of nonterminals now is , and the productions are the following:

Now we can avoid situations in which the left-hand side of a production can extend over the limits of words in a derivation because of the iteration. The above derivations can be used only by beginning with and getting derivation . Here we can not replace unless the last symbol in is a terminal symbol, and only after using a production of the form .

It is easy to show that for each type.

Exercises

1.1-1 Give a grammar which generates language and determine its type.

1.1-2 Let be an extended context-free grammar, where ,

.

Give an equivalent context-free grammar.

1.1-3 Show that and are regular languages over arbitrary alphabet .

1.1-4 Give a grammar to generate language , where represents the number of 0's in word and the number of 1's.

1.1-5 Give a grammar to generate all natural numbers.

1.1-6 Give a grammar to generate the following languages, respectively:

,

,

,

.

1.1-7 Let be an extended grammar, where , and contains the productions:

Determine the type of this grammar. Give an equivalent, not extended grammar with the same type. What language it generates?