1. fejezet - Elements of Formal Languages

Summary of the chapter: In this chapter, we discuss the basic expressions, notations, definitions and theorems of the scientific field of formal languages and automata theory. In the first part of this chapter, we introduce the alphabet, the word, the language and the operations over them. In the second part, we show general rewriting systems and a way to define algorithms by rewriting systems, namely Markov (normal) algorithms. In the third part, we describe a universal method to define a language by a generative grammar. Finally, in the last part of this chapter we show the Chomsky hierarchy: a classification of generative grammars are based on the form of their production rules, and the classification of formal languages are based on the classification of generative grammars generating them.

1.1. Basic Terms

An alphabet is a finite nonempty set of symbols. We can use the union, the intersection and the relative complement set operations over the alphabet. The absolute complement operation can be used if a universe set is defined.

Example 1. Let the alphabet E be the English alphabet, the alphabet V contains the vowels, and the alphabet C is the set of the consonants. Then VC=E, VC=and

A word is a finite sequence of symbols of the alphabet. The length of the word is the number of symbols it is composed of, with repetitions. Two words are equal if they have the same length and they have the same letter in each position. This might sound trivial, but let us see the following example:

Example 2. Let the alphabet V={1,2,+} and the words p=1+1, q=2. In this case 1+1≠ 2, i.e. pq, because these two words have different lengths, and also different letters in the first position.

There is a special operation on words called concatenation, this is the operation of joining two words end to end.

Example 3. Let the alphabet E be the English alphabet. Let the word p=railway, and the word q=station over the alphabet E. Then, the concatenation of p and q is p· q=railwaystation. The length of p isp∣=7 and the length of q isq∣=7, as well.

If there is no danger of confusion we can omit the dot from between the parameters of the concatenation operation. It is obvious that the equation ∣pq∣=∣p∣+∣q∣ holds for all words p and q. There is a special word called an empty word, whose length is 0 and denoted by λ. The empty word is the identity element of the concatenation operation, λp=pλ=p holds for each word p over any alphabet. The word u is a prefix of the word p if there exists a word w such that p=uw, and the word w is a suffix of the word p if there exists a word u such that p=uw. The word v is a subword of word p if there exists words u,w such that p=uvw. The word u is a proper prefix of the word p, if it is a prefix of p, and the following properties hold: u≠λ and up. The word w is a proper suffix of the word p, if it is a suffix of p, and w≠λ, wp. The word v is a proper subword of the word p, if it is a subword of p, and v≠λ, vp. As you can see, both the prefixes and suffixes are subwords, and both the proper prefixes and proper suffixes are proper subwords, as well.

Example 4. Let the alphabet E be the English alphabet. Let the word p = railwaystation, and the words u=rail, v = way and w = station. In this example, the word u is a prefix, the word w is a suffix and the word v is a subword of the word p. However, the word uv is also a prefix of the word p, and it is a subword of the word p, as well. The word uvw is a suffix of the word p, but it is not a proper suffix.

We can use the exponentiation operation on a word in a classical way, as well. p0 = λ by definition, p1 = p, and pi = pi-1p, for each integer i ≥ 2. The union of pi for each integer i ≥ 0 is denoted by p*, and the union of pi for each integer i ≥ 1 is denoted by p+. These operations are called Kleene star and Kleene plus operations. p*=p+∪{λ} holds for each word p, and if p≠λ then p+=p* \ {λ}. We can also use the Kleene star and Kleene plus operations on an alphabet. For an alphabet V we denote the set of all words over the alphabet by V *, and the set of all words, but the empty word by V +.

A language over an alphabet is not necessarily a finite set of words, and it is usually denoted by L. For a given alphabet V the language L over V is L ⊆ V *. We have a set again, so we can use the classical set operations, union, intersection, and the complement operation, if the operands are defined over the same alphabet. For the absolute complement operation we use V * for universe, so We can also use the concatenation operation. The concatenation of the languages L1 and L2 contains all words pq where pL1 and qL2. The exponentiation operation is defined in a classical way, as well. L0={λ} by definition, L1=L, and Li=Li-1· L, for each integer i ≥ 2. The language L 0 contains exactly one word, the empty word. The empty language does not contain any words, Le=∅, and L 0Le. We can also use the Kleene star and Kleene plus closure for languages. so L*=L + if and only if λ ∈ L.

The algebraic approach to formal languages can be useful for readers who prefer the classical mathematical models. The free monoid on an alphabet V is the monoid whose elements are from V *. From this point of view both operations, concatenation, (also called product) - which is not a commutative operation in this case -, and the union operation (also called addition), create a free monoid on set V, because these operations are associative, and they both have an identity element.

1. Associative:

• (L1 · L2) · L3 = L1 · (L2 · L3),

• (L1L2) ∪ L3 = L1 ∪ (L2L3),

where L1, L2, L3V *.

2. Identity element:

• L0 · L1 = L1 · L0 = L1,

• LeL1 = L1Le = L1,

where L1V *, L 0= {λ}, Le = ∅.

The equation Le · L1 = L1 · Le = Le also holds for each L1V *.