The perfect definition of a programming language includes the definition of its syntax and semantics.
The syntax of the programming languages cannot be written by context free grammars. It is possible by using context dependent grammars, twolevel grammars or attribute grammars. For these grammars there are not efficient parsing methods, hence the description of a language consists of two parts. The main part of the syntax is given using context free grammars, and for the remaining part a context dependent or an attribute grammar is applied. For example, the description of the program structure or the description of the statement structure belongs to the first part, and the type checking, the scope of variables or the correspondence of formal and actual parameters belong to the second part.
The checking of properties written by context free grammars is called syntactic analysis or parsing. Properties that cannot be written by context free grammars are called form the static semantics. These properties are checked by the semantic analyser.
The conventional semantics has the name runtime semantics or dynamic semantics. The dynamic semantics can be given by verbal methods or some interpreter methods, where the operation of the program is given by the series of statealterations of the interpreter and its environment.
We deal with context free grammars, and in this section we will use extended grammars for the syntactic analysis. We investigate on methods of checking of properties which are written by context free grammars. First we give basic notions of the syntactic analysis, then the parsing algorithms will be studied.
Definition 2.1 Let be a grammar. If and then is a sentential form. If and then is a sentence of the language defined by the grammar.
The sentence has an important role in parsing. The program written by a programmer is a series of terminal symbols, and this series is a sentence if it is correct, that is, it has not syntactic errors.
Definition 2.2 Let be a grammar and is a sentential form (). We say that is a phrase of , if there is a symbol , which and . We say that is a simple phrase of , if .
We note that every sentence is phrase. The leftmost simple phrase has an important role in parsing; it has its own name.
Definition 2.3 The leftmost simple phase of a sentence is the handle.
The leaves of the syntax tree of a sentence are terminal symbols, other points of the tree are nonterminal symbols, and the root symbol of the tree is the start symbol of the grammar.
In an ambiguous grammar there is at least one sentence, which has several syntax trees. It means that this sentence has more than one analysis, and therefore there are several target programs for this sentence. This ambiguity raises a lot of problems, therefore the compilers translate languages generated by unambiguous grammars only.
We suppose that the grammar has properties as follows:
the grammar is cycle free, that is, it has not series of derivations rules (),
the grammar is reduced, that is, there are not “unused symbols” in the grammar, all of nonterminals happen in a derivation, and from all nonterminals we can derive a part of a sentence. This last property means that for all it is true that , where and ().
As it has shown, the lexical analyser translates the program written by a programmer into series of terminal symbols, and this series is the input of syntactic analyser. The task of syntactic analyser is to decide if this series is a sentence of the grammar or it is not. To achieve this goal, the parser creates the syntax tree of the series of symbols. From the known start symbol and the leaves of the syntax tree the parser creates all vertices and edges of the tree, that is, it creates a derivation of the program. If this is possible, then we say that the program is an element of the language. It means that the program is syntactically correct.
Hence forward we will deal with left to right parsing methods. These methods read the symbols of the programs left to right. All of the real compilers use this method.
To create the inner part of the syntax tree there are several methods. One of these methods builds the syntax tree from its start symbol . This method is called topdown method. If the parser goes from the leaves to the symbol , then it uses the bottomup parsing method.
We deal with topdown parsing methods in Subsection 2.3.1. We investigate bottomup parsers in Subsection 2.3.2; now these methods are used in real compilers.
If we analyse from top to down then we start with the start symbol. This symbol is the root of syntax tree; we attempt to construct the syntax tree. Our goal is that the leaves of tree are the terminal symbols.
First we review the notions that are necessary in the topdown parsing. Then the table methods and the recursive descent method will be analysed.
Our methods build the syntax tree topdown and read symbols of the program left to right. For this end we try to create terminals on the left side of sentential forms.
Definition 2.4 If then the leftmost direct derivation of the sentential form () is , and
Definition 2.5 If all of direct derivations in () are leftmost, then this derivation is said to be leftmost derivation, and
In a leftmost derivation terminal symbols appear at the left side of the sentential forms. Therefore we use leftmost derivations in all of topdown parsing methods. Hence if we deal with topdown methods, we do not write the text “leftmost” at the arrows.
One might as well say that we create all possible syntax trees. Reading leaves from left to right, we take sentences of the language. Then we compare these sentences with the parseable text and if a sentence is same as the parseable text, then we can read the steps of parsing from the syntax tree which is belongs to this sentence. But this method is not practical; generally it is even impossible to apply.
A good idea is the following. We start at the start symbol of the grammar, and using leftmost derivations we try to create the text of the program. If we use a not suitable derivation at one of steps of parsing, then we find that, at the next step, we can not apply a proper derivation. At this case such terminal symbols are at the left side of the sentential form, that are not same as in our parseable text.
For the leftmost terminal symbols we state the theorem as follows.
The proof of this theorem is trivial. It is not possible to change the leftmost terminal symbols of sentential forms using derivation rules of a context free grammar.
This theorem is used during the building of syntax tree, to check that the leftmost terminals of the tree are same as the leftmost symbols of the parseable text. If they are different then we created wrong directions with this syntax tree. At this case we have to make a backtrack, and we have to apply an other derivation rule. If it is impossible (since for example there are no more derivation rules) then we have to apply a backtrack once again.
General topdown methods are realized by using backtrack algorithms, but these backtrack steps make the parser very slow. Therefore we will deal only with grammars such that have parsing methods without backtracks.
The main properties of grammars are the following. If, by creating the leftmost derivation (), we obtain the sentential form () at some step of this derivation, and our goal is to achieve , then the next step of the derivation for nonterminal is determinable unambiguously from the first symbols of .
To look ahead symbols we define the function .
Definition 2.7 Let be the set as follows.
The set consists of the first symbols of ; for , it consists the full . If , then .
Definition 2.8 The grammar is a grammar , if for derivations
() the equality
implies
Using this definition, if a grammar is a grammar then the symbol after the parsed determine the next derivation rule unambiguously (Figure 2.8).
One can see from this definition that if a grammar is an grammar then for all it is also an grammar. If we speak about grammar then we also mean that is the least number such that the properties of the definition are true.
Example 2.5 The next grammar is a grammar. Let be a grammar whose derivation rules are:
We have to use the derivation for the start symbol if the next symbol of the parseable text is or . We use the derivation if the next symbol is the mark #.
Example 2.6 The next grammar is a grammar. Let be a grammar whose the derivation rules are:
One can see that at the last step of derivations
and
if we look ahead one symbol, then in both derivations we obtain the symbol . The proper rule for symbol is determined to look ahead two symbols ( or ).
There are context free grammars such that are not grammars. For example the next grammar is not grammar for any .
Example 2.7 Let be a grammar whose the derivation rules are:
consists of sentences és . If we analyse the sentence , then at the first step we can not decide by looking ahead symbols whether we have to use the derivation or , since for all .
By the definition of the grammar, if we get the sentential form using leftmost derivations, then the next symbol determines the next rule for symbol . This is stated in the next theorem.
Theorem 2.9 Grammar is a grammar iff
implies
If there is a rule in the grammar, then the set consists the length prefixes of terminal series generated from . It implies that, for deciding the property , we have to check not only the derivation rules, but also the infinite derivations.
We can give good methods, that are used in the practice, for grammars only. We define the followerseries, which follow a symbol or series of symbols.
Definition 2.10 , and if , then () .
The second part of the definition is necessary because if there are no symbols after the in the derivation , that is , then the next symbol after is the mark # only.
() consists of terminal symbols that can be immediately after the symbol in the derivation
Theorem 2.11 The grammar is a grammar iff, for all nonterminal and for all derivation rules ,
In this theorem the expression means that we have to concatenate to the elements of set separately, and for all elements of this new set we have to apply the function .
It is evident that Theorem 2.11 is suitable to decide whether a grammar is or it is not.
Hence forward we deal with languages determined by grammars, and we investigate the parsing methods of languages. For the sake of simplicity, we omit indexes from the names of functions és .
The elements of the set are determined using the next algorithm.
First(
)
1IF
2THEN
3IF
, where 4THEN
5IF
, where 6THEN
IF
7THEN
8ELSE
9FOR
all 10DO
11FOR
TO
12DO
IF
13THEN
14IF
15THEN
16IF
17THEN
18FOR
TO
19DO
IF
20THEN
21IF
22THEN
23RETURN
In lines 1–4 the set is given for and a terminal symbol . In lines 5–15 we construct the elements of this set for a nonterminal . If is derivated from then we put symbol into the set in lines 6–7 and 14–15. If the argument is a symbol stream then the elements of the set are constructed in lines 16–22. We notice that we can terminate the
fOR
cycle in lines 11 and 18 if , since in this case it is not possible to derive symbol from .
In Theorem 2.11 and hereafter, it is necessary to know the elements of the set . The next algorithm constructs this set.
Follow(
)
1IF
2THEN
3ELSE
4FOR
all rules 5DO
IF
6THEN
7IF
8THEN
9ELSE
10RETURN
The elements of the set get into the set . In lines 4–9 we check that, if the argumentum is at the right side of a derivation rule, what symbols may stand immediately after him. It is obvious that no is in this set, and the symbol is in the set only if the argumentum is the rightmost symbol of a sentential form.
Suppose that we analyse a series of terminal symbols and the part has already been analysed without errors. We analyse the text with a topdown method, so we use leftmost derivations. Suppose that our sentential form is , that is, it has form or () (Figure 2.9).
In the first case the next step is the substitution of symbol . We know the next element of the input series, this is the terminal , therefore we can determine the correct substitution of symbol . This substitution is the rule for which . If there is such a rule then, according to the definition of grammar, there is exactly one. If there is not such a rule, then a syntactic error was found.
In the second case the next symbol of the sentential form is the terminal symbol , thus we look out for the symbol as the next symbol of the analysed text. If this comes true, that is, , then the symbol is a correct symbol and we can go further. We put the symbol into the already analysed text. If , then here is a syntactic error. We can see that the position of the error is known, and the erroneous symbol is the terminal symbol .
The action of the parser is the following. Let # be the sign of the right end of the analysed text, that is, the mark # is the last symbol of the text. We use a stack through the analysing, the bottom of the stack is signed by mark #, too. We give serial numbers to derivation rules and through the analysing we write the number of the applied rule into a list. At the end of parsing we can write the syntax tree from this list (Figure 2.10).
We sign the state of the parser using triples . The symbol is the text not analysed yet. is the part of the sentential form corresponding to the not analysed text; this information is in the stack, the symbol is at the top of the stack. is the list of the serial numbers of production rules.
If we analyse the text then we observe the symbol at the top of the stack, and the symbol that is the first symbol of the not analysed text. The name of the symbol is actual symbol. There are pointers to the top of the stack and to the actual symbol.
We use a top down parser, therefore the initial content of the stack is . If the initial analysed text is , then the initial state of the parsing process is the triple , where is the sign of the empty list.
We analyse the text, the series of symbols using a parsing table. The rows of this table sign the symbols at the top of the stack, the columns of the table sign the next input symbols, and we write mark # to the last row and the last column of the table. Hence the number of rows of the table is greater by one than the number of symbols of the grammar, and the number of columns is greater by one than the number of terminal symbols of the grammar.
The element of the table is as follows.
We fill in the parsing table using the following algorithm.
LL(1)TableFillin(
)
1FOR
all 2DO
IF
the th rule 3THEN
FOR
all ra 4DO
5IF
6THEN
FOR
all 7DO
8FOR
all 9DO
10 11FOR
all and all 12DO
IF
”empty” 13THEN
14RETURN
At the line 10 we write the text accept into the right lower corner of the table. At the lines 8–9 we write the text pop into the main diagonal of the square labelled by terminal symbols. The program in lines 1–7 writes a tuple in which the first element is the right part of a derivation rule and the second element is the serial number of this rule. In lines 12–13 we write error texts into the empty positions.
The actions of the parser are written by statetransitions. The initial state is , where the initial text is , and the parsing process will be finished if the parser goes into the state , this state is the final state. If the text is in an intermediate step, and the symbol is at the top of stack, then the possible statetransitions are as follows.
The letters O.K. mean that the analysed text is syntactically correct; the text ERROR means that a syntactic error is detected.
The actions of this parser are written by the next algorithm.
LL(1)Parser(
)
1 , 2REPEAT
3IF
és 4THEN
5ELSE
IF
6THEN
Then . 7ELSE
IF
8THEN
Then . 9ELSE
Then . 10UNTIL
OR
11RETURN
The input parameters of this algorithm are the text and the parsing table . The variable describes the state of the parser: its value is analyse, during the analysis, and it is either O.K. or ERROR. at the end. The parser determines his action by the actual symbol and by the symbol at the top of the stack, using the parsing table . In the line 3–4 the parser builds the syntax tree using the derivation rule . In lines 5–6 the parser executes a shift action, since there is a symbol at the top of the stack. At lines 8–9 the algorithm finishes his work if the stack is empty and it is at the end of the text, otherwise a syntactic error was detected. At the end of this work the result is O.K. or ERROR in the variable , and, as a result, there is the triple at the output of this algorithm. If the text was correct, then we can create the syntax tree of the analysed text from the third element of the triple. If there was an error, then the first element of the triple points to the position of the erroneous symbol.
Example 2.8 Let be a grammar , where the set of derivation rules:
From these rules we can determine the sets. To fill in the parsing table, the following sets are required:
,
,
,
,
,
,
,
.
The parsing table is as follows. The empty positions in the table mean errors
Example 2.9 Using the parsing table of the previous example, analyse the text .
The syntax tree of the analysed text is the Figure 2.11.
There is another frequently used method for the backtrackless topdown parsing. Its essence is that we write a real program for the applied grammar. We create procedures to the symbols of grammar, and using these procedures the recursive procedure calls realize the stack of the parser and the stack management. This is a topdown parsing method, and the procedures call each other recursively; it is the origin of the name of this method, that is, recursivedescent method.
To check the terminal symbols we create the procedure Check. Let the parameter of this procedure be the, that is the leftmost unchecked terminal symbol of the sentential form, and let the actual symbol be the symbol which is analysed in that moment.
procedure Check(a); begin if actual_symbol = a then Next_symbol else Error_report end;
The procedure Next_symbol reads the next symbol, it is a call for the lexical analyser. This procedure determines the next symbol and put this symbol into the actual_symbol variable. The procedure Error_report creates an error report and then finishes the parsing.
We create procedures to symbols of the grammar as follows. The procedure of the nonterminal symbol is the next.
procedure A; begin T(A) end;
where T(A)
is determined by symbols of the right part of derivation rule having symbol in its left part.
The grammars which are used for syntactic analysis are reduced grammars. It means that no unnecessary symbols in the grammar, and all of symbols occur at the left side at least one reduction rule. Therefore, if we consider the symbol , there is at least one production rule.
If there is only one production rule for the symbol ,
let the program of the rule is as follows: Check(a)
,
for the rule we give the procedure call B
,
for the rule we give the next block:
begin
T(X_1);
T(X_2);
...
T(X_n)
end;
If there are more rules for the symbol :
If the rules are free, that is from () it is not possible to deduce , then T(A)
case actual_symbol of
First(alpha_1) : T(alpha_1);
First(alpha_2) : T(alpha_2);
...
First(alpha_n) : T(alpha_n)
end;
where First(alpha_i)
is the sign of the set .
We note that this is the first point of the method of recursivedescent parser where we use the fact that the grammar is an grammar.
We use the grammar to write a programming language, therefore it is not comfortable to require that the grammar is a free grammar. For the rules we create the next T(A)
program:
case actual_symbol of
First(alpha_1) : T(alpha_1);
First(alpha_2) : T(alpha_2);
...
First(alpha_(n1)) : T(alpha_(n1));
Follow(A) : skip
end;
where Follow(A)
is the set .
In particular, if the rules for some
, that is , then the th row of the case
statement is
Follow(A) : skip
In the program T(A)
, if it is possible, we use ifthenelse
or while
statement instead of the statement case
.
The start procedure, that is the main program of this parsing program, is the procedure which is created for the start symbol of the grammar.
We can create the recursivedescent parsing program with the next algorithm. The input of this algorithm is the grammar , and the result is parsing program . In this algorithm we use a
WRITEPROGRAM
procedure, which concatenates the new program lines to the program . We will not go into the details of this algorithm.
CreateRecDesc(
)
1 2WRITEPROGRAM(
3procedure Check(a);
4begin
5if actual_symbol = a
6then Next_symbol
7else Error_report
8end;
9)
10FOR
all symbol of the grammar 11DO
IF
12THEN
WRITEPROGRAM(
13program S;
14begin
15RECDESCSTAT
16end;
17)
18ELSE
WRITEPROGRAM(
19procedure A;
20begin
21RECDESCSTAT
22end;
23)
24RETURN
The algorithm creates the Check procedure in lines 2–9. Then, for all nonterminals of grammar , it determines their procedures using the algorithm
RECDESCSTAT
. In the lines 11–17, we can see that for the start symbol we create the main program. The output of the algorithm is the parsing program.
RecDescStat(
)
1IF
there is only one rule 2THEN
RECDESCSTAT1
. 3ELSE
RECDESCSTAT2
. 4RETURN
The form of the statements of the parsing program depends on the derivation rules of the symbol . Therefore the algorithm
RECDESCSTAT
divides the next tasks into two parts. The algorithm
RECDESCSTAT1
deals with the case when there is only one derivation rule, and the algorithm
RECDESCSTAT2
creates the program for the alternatives.
RecDescStat1(
)
1IF
2THEN
WRITEPROGRAM(
3Check(a)
4)
5IF
6THEN
WRITEPROGRAM(
7B
8)
9IF
10THEN
WRITEPROGRAM(
11begin
11 12RECDESCSTAT1
;
13RECDESCSTAT1
;
14 15RECDESCSTAT1
16end;
17RETURN
RecDescStat2(
)
1IF
the rules are  free 2THEN
WRITEPROGRAM(
3case actual_symbol of
4First(alpha_1) :
RECDESCSTAT1
;
5...
6First(alpha_n) :
RECDESCSTAT1
7end;
8)
9IF
there is a rule, 10THEN
WRITEPROGRAM(
11case actual_symbol of
12First(alpha_1) :
RECDESCSTAT1
;
13...
14First(alpha_(i1)) :
RECDESCSTAT1
;
15Follow(A) : skip;
16First(alpha_(i+1)) :
RECDESCSTAT1
;
17...
18First(alpha_n) :
RECDESCSTAT1
19end;
20)
21RETURN
These two algorithms create the program described above.
Checking the end of the parsed text is achieved by the recursive descent parsing method with the next modification. We generate a new derivation rule for the end mark #. If the start symbol of the grammar is , then we create the new rule , where the new symbol is the start symbol of our new grammar. The mark # is considered as terminal symbol. Then we generate the parsing program for this new grammar.
Example 2.10 We augment the grammar of the Example 2.8 in the above manner. The production rules are as follows.
In the example 2.8 we give the necessary and sets. We use the next sets:
,
,
,
,
,
.
In the comments of the program lines we give the using of these sets. The first characters of the comment are the character pair 
.
The program of the recursivedescent parser is the following.
program S'; begin E; Check(#) end. procedure E; begin T; E' end; procedure E'; begin case actual_symbol of + : begin  First(+TE') Check(+); T; E' end; ),# : skip  Follow(E') end end; procedure T; begin F; T' end; procedure T'; begin case actual_symbol of * : begin  First(*FT') Check(*); F; T' end; +,),# : skip  Follow(T') end end; procedure F; begin case actual_symbol of ( : begin  First((E)) Check((); E; Check()) end; i : Check(i)  First(i) end end;
We can see that the main program of this parser belongs to the symbol .
If we analyse from bottom to up, then we start with the program text. We search the handle of the sentential form, and we substitute the nonterminal symbol that belongs to the handle, for this handle. After this first step, we repeat this procedure several times. Our goal is to achieve the start symbol of the grammar. This symbol will be the root of the syntax tree, and by this time the terminal symbols of the program text are the leaves of the tree.
First we review the notions which are necessary in the parsing.
To analyse bottomup, we have to determine the handle of the sentential form. The problem is to create a good method which finds the handle, and to find the best substitution if there are more than one possibilities.
Definition 2.12 If , then the rightmost substitution of the sentential form () is , that is
Definition 2.13 If the derivation () all of the substitutions were rightmost substitution, then this derivation is a rightmost derivation,
In a rightmost derivation, terminal symbols are at the right side of the sentential form. By the connection of the notion of the handle and the rightmost derivation, if we apply the steps of a rightmost derivation backwards, then we obtain the steps of a bottomup parsing. Hence the bottomup parsing is equivalent with the “inverse” of a rightmost derivation. Therefore, if we deal with bottomup methods, we will not write the text “rightmost” at the arrows.
General bottomup parsing methods are realized by using backtrack algorithms. They are similar to the topdown parsing methods. But the backtrack steps make the parser very slow. Therefore we only deal with grammars such that have parsing methods without backtracks.
Hence forward we produce a very efficient algorithm for a large class of contextfree grammars. This class contains the grammars for the programming languages.
The parsing is called parsing; the grammar is called grammar. means the “Left to Right” method, and means that if we look ahead symbols then we can determine the handles of the sentential forms. The parsing method is a shiftreduce method.
We deal with parsing only, since for all grammar there is an equivalent grammar. This fact is very important for us since, using this type of grammars, it is enough to look ahead one symbol in all cases.
Creating parsers is not an easy task. However, there are such standard programs (for example the yacc
in UNIX systems), that create the complete parsing program from the derivation rules of a grammar. Using these programs the task of writing parsers is not too hard.
After studying the grammars we will deal with the parsing method. This method is used in the compilers of modern programming languages.
As we did previously, we write a mark # to the right end of the text to be analysed. We introduce a new nonterminal symbol and a new rule into the grammar.
Definition 2.14 Let be the augmented grammar belongs to grammar , where augmented grammar
Assign serial numbers to the derivation rules of grammar, and let be the 0th rule. Using this numbering, if we apply the 0th rule, it means that the parsing process is concluded and the text is correct.
We notice that if the original start symbol does not happen on the right side of any rules, then there is no need for this augmentation. However, for the sake of generality, we deal with augmented grammars only.
Definition 2.15 The augmented grammar is an LR(k) grammar , if for derivations
() the equality
implies
The feature of grammars is that, in the sentential form , looking ahead symbol from unambiguously decides if is or is not the handle. If the handle is , then we have to reduce the form using the rule , that results the new sentential form is . Its reason is the following: suppose that, for sentential forms and , (their prefixes are same), , and we can reduce to and to . In this case, since the grammar is a grammar, and hold. Therefore in this case either the handle is or never is the handle.
Example 2.11 Let be a grammar and let the derivation rules be as follows.
This grammar is not an grammar, since using notations of the definition, in the derivations
it holds that , and .
Example 2.12 The next grammar is a grammar. , the derivation rules are:
.
In the next example we show that there is a contextfree grammar, such that is not grammar for any .
Example 2.13 Let be a grammar and let the derivation rules be
Now for all
and
but
It is not sure that, for a grammar, we can find an equivalent grammar. However, grammars have this nice property.
Theorem 2.16 For all grammar there is an equivalent grammar.
The great significance of this theorem is that it makes sufficient to study the grammars instead of grammars.
Now we define a very important notion of the parsings.
Definition 2.17 If is the handle of the () sentential form, then the prefixes of are the viable prefixes of .
Example 2.14 Let be a grammar and the derivation rule as follows.
is a sentential form, and the first is the handle. The viable prefixes of this sentential form are .
By the above definition, symbols after the handle are not parts of any viable prefix. Hence the task of finding the handle is the task of finding the longest viable prefix.
For a given grammar, the set of viable prefixes is determined, but it is obvious that the size of this set is not always finite.
The significance of viable prefixes are the following. We can assign states of a deterministic finite automaton to viable prefixes, and we can assign state transitions to the symbols of the grammar. From the initial state we go to a state along the symbols of a viable prefix. Using this property, we will give a method to create an automaton that executes the task of parsing.
Definition 2.18 If is a rule of a grammar, then let
be a LR(1)item, where is the core of the item, and is the lookahead symbol of the item.
The lookahead symbol is instrumental in reduction, i.e. it has form . It means that we can execute reduction only if the symbol follows the handle .
Definition 2.19 The item is valid for the viable prefix if
and is the first symbol of or if then .
Example 2.15 Let a grammar and the derivation rules as follows.
Using these rules, we can derive . Here is a viable prefix, and is valid for this viable prefix. Similarly, , and item is valid for viable prefix .
Creating a parser, we construct the canonical sets of items. To achieve this we have to define the closure and read functions.
Definition 2.20 Let the set be a set of items for a given grammar. The set closure() consists of the next items:
1. every element of the set is an element of the set ,
2. if , and is a derivation rule of the grammar, then for all ,
3. the set is needed to expand using the step 2 until no more items can be added to it.
By definitions, if the item is valid for the viable prefix , then the item is valid for the same viable prefix in the case of . (Figure 2.14). It is obvious that the function closure creates all of items which are valid for viable prefix .
We can define the function , i.e. the closure of set by the following algorithm. The result of this algorithm is the set .
ClosureSetofItems(
)
1 2FOR
all item 3DO
4RETURN
ClosureItem(
)
1 2IF
the item has form 3THEN
4 5REPEAT
6FOR
for all which have form 7DO
FOR
for all rules 8DO
FOR
for all symbols 9DO
10 11IF
12THEN
13 14UNTIL
15RETURN
The algorithm
CLOSUREITEM
creates , the closure of item . If, in the argument , the “point” is followed by a terminal symbol, then the result is this item only (line 1). If in the “point” is followed by a nonterminal symbol , then we can create new items from every rule having the symbol at their left side (line 9). We have to check this condition for all new items, too, the
rEPEAT
cycle is in line 5–14. These steps are executed until no more items can be added (line 14). The set contains the items to be checked, the set contains the new items. We can find the operation in line 10.
Definition 2.21 Let be a set of items for the grammar . Then the set read() () consists of the following items.
1. if , then all items of the set are in ,
2. the set is extended using step 1 until no more items can be added to it.
The function “reads symbol ” in items of , and after this operation the sign "point" in the items gets to the right side of . If the set contains the valid items for the viable prefix then the set contains the valid items for the viable prefix .
The algorithm
READSETOFITEMS
executes the function read. The result is the set .
ReadSet(
)
1 2FOR
all 3DO
4RETURN
ReadItem(
)
1IF
and 2THEN
3ELSE
4RETURN
Using these algorithms we can create all of items which writes the state after reading of symbol .
Now we introduce the following notation for items, to give shorter descriptions. Let
be a notation for items
Example 2.16 The item is an item of the grammar in the example 2.15. For this item
We can create the canonical sets of items or shortly the canonical sets with the following method.
Definition 2.22 Canonical sets of items are the following.
,
Create the set for a symbol . If this set is not empty and it is not equal to canonical set then it is the next canonical set .
Repeat this operation for all possible terminal and nonterminal symbol . If we get a nonempty set which is not equal to any of previous sets then this set is a new canonical set, and its index is greater by one as the maximal index of previously generated canonical sets.
repeat the above operation for all previously generated canonical sets and for all symbols of the grammar until no more items can be added to it.
The sets
are the canonical sets of items of the grammar .
The number of elements of items for a grammar is finite, hence the above method is terminated in finite time.
The next algorithm creates canonical sets of the grammar .
CreateCanonicalSets(
)
1 2 3 4REPEAT
5 6FOR
all re 7DO
8FOR
all re 9DO
10IF
and 11THEN
12 13 14 15UNTIL
16RETURN
The result of the algorithm is . The first canonical set is the set in the line 2. Further canonical sets are created by functions
CLOSURESETOFITEMS(READSET)
in the line 9. The program in the line 10 checks that the new set differs from previous sets, and if the answer is true then this set will be a new set in lines 11–12. The
fOR
cycle in lines 6–14 guarantees that these operations are executed for all sets previously generated. In lines 3–14 the
rEPEAT
cycle generate new canonical sets as long as it is possible.
Example 2.17 The canonical sets of items for the Example 2.15 are as follows.
The automaton of the parser is in Figure 2.15.
If the canonical sets of items
were created, then assign the state of an automaton to the set . Relation between the states of the automaton and the canonical sets of items is stated by the next theorem. This theorem is the ”great”' theorem of the parsing.
Theorem 2.23 The set of the items being valid for a viable prefix can be assigned to the automatonstate such that there is path from the initial state to state labeled by .
This theorem states that we can create the automaton of the parser using canonical sets. Now we give a method to create this parser from canonical sets of items.
The deterministic finite automaton can be described with a table, that is called parsing table. The rows of the table are assigned to the states of the automaton.
The parsing table has two parts. The first is the action table. Since the operations of parser are determined by the symbols of analysed text, the action table is divided into columns labeled by the terminal symbols. The action table contains information about the action performing at the given state and at the given symbol. These actions can be shifts or reductions. The sign of a shift operation is , where is the next state. The sign of the reduction is , where is the serial number of the applied rule. The reduction by the rule having the serial number zero means the termination of the parsing and that the parsed text is syntactically correct; for this reason we call this operation accept.
The second part of the parsing table is the goto table. In this table are informations about shifts caused by nonterminals. (Shifts belong to terminals are in the action table.)
Let be the set of states of the automata. The th row of the table is filled in from the items of canonical set .
The th row of the action table:
if and then ,
if and , then , where is the th rule of the grammar,
if , then .
The method of filling in the goto table:
if , then .
In both table we have to write the text error into the empty positions.
These action and goto tables are called canonical parsing tables.
Theorem 2.24 The augmented grammar is grammar iff we can fill in the parsing tables created for this grammar without conflicts.
We can fill in the parsing tables with the next algorithm.
FillinLR(1)Table(
)
1FOR
all LR(1) canonical sets 2DO
FOR
all LR(1)items 3IF
and 4THEN
5IF
and and the th rule 6THEN
7IF
8THEN
9IF
10THEN
11FOR
all 12DO
IF
“empty” 13THEN
14FOR
all 15DO
IF
”empty” 16THEN
17RETURN
action, goto
We fill in the tables its linebyline. In lines 2–6 of the algorithm we fill in the action table, in lines 9–10 we fill in the goto table. In lines 11–13 we write the error into the positions which remained empty.
Now we deal with the steps of the parsing. (Figure 2.16).
The state of the parsing is written by configurations. A configuration of the parser consists of two parts, the first is the stack and the second is the unexpended input text.
The stack of the parsing is a double stack, we write or read two data with the operations push or pop. The stack consists of pairs of symbols, the first element of pairs there is a terminal or nonterminal symbol, and the second element is the serial number of the state of automaton. The content of the start state is .
The start configuration is , where means the unexpected text.
The parsing is successful if the parser moves to final state. In the final state the content of the stack is , and the parser is at the end of the text.
Suppose that the parser is in the configuration . The next move of the parser is determined by .
State transitions are the following.
If , i.e. the parser executes a shift, then the actual symbol and the new state are written into the stack. That is, the new configuration is
If , then we execute a reduction by the th rule . In this step we delete rows, i.e. we delete elements from the stack, and then we determine the new state using the goto table. If after the deletion there is the state at the top of the stack, then the new state is .
where .
If , then the parsing is completed, and the analysed text was correct.
If , then the parsing terminates, and a syntactic error was discovered at the symbol .
The parser is often named canonical parser.
Denote the action and goto tables together by . We can give the following algorithm for the steps of parser.
LR(1)Parser(
)
1 , 2REPEAT
3 4IF
5THEN
6ELSE
IF
and is the th rule and 7 and 8THEN
9ELSE
IF
10THEN
11ELSE
12UNTIL
OR
13RETURN
The input parameters of the algorithm are the text and table . The variable indicates the action of the parser. It has value parsing in the intermediate states, and its value is O.K. or ERROR at the final states. In line 3 we detail the configuration of the parser, that is necessary at lines 6–8. Using the action table, the parser determines its move from the symbol at the top of the stack and from the actual symbol . In lines 4–5 we execute a shift step, in lines 6–8 a reduction. The algorithm is completed in lines 9–11. At this moment, if the parser is at the end of text and the state 0 is at the top of stack, then the text is correct, otherwise a syntax error was detected. According to this, the output of the algorithm is O.K. or ERROR, and the final configuration is at the output, too. In the case of error, the first symbol of the second element of the configuration is the erroneous symbol.
Example 2.18 The action and goto tables of the parser for the grammar of Example 2.15 are as follows. The empty positions denote errors.
Example 2.19 Using the tables of the previous example, analyse the text .
The syntax tree of the sentence is in Figure 2.11.
Our goal is to decrease the number of states of the parser, since not only the size but the speed of the compiler is dependent on the number of states. At the same time, we wish not to cut radically the set of grammars and languages, by using our new method.
There are a lot of items in the canonical sets, such that are very similar: their core are the same, only their lookahead symbols are different. If there are two or more canonical sets in which there are similar items only, then we merge these sets.
If the canonical sets és a are mergeable, then let .
Execute all of possible merging of canonical sets. After renumbering the indexes we obtain sets ; these are the merged canonical sets or canonical sets.
We create the parser from these united canonical sets.
Example 2.20 Using the canonical sets of the example 2.17, we can merge the next canonical sets:
and ,
and ,
and .
In the Figure 2.15 it can be seen that mergeable sets are in equivalent or similar positions in the automaton.
There is no difficulty with the function read if we use merged canonical sets. If
and
then
We can prove this on the following way. By the definition of function read, the set depends on the core of items in only, and it is independent of the lookahead symbols. Since the cores of items in the sets are the same, the cores of items of
are also the same. It follows that these sets are mergeable into a set , thus .
However, after merging canonical sets of items, elements of this set can raise difficulties. Suppose that .
After merging there are not shiftshift conflicts. If
and
then there is a shift for the symbol and we saw that the function read does not cause problem, i.e. the set is equal to the set .
If there is an item
in the canonical set and there is an item
in the set a , then the merged set is an inadequate set with the symbol , i.e. there is a shiftreduce conflict in the merged set.
But this case never happens. Both items are elements of the set and of the set . These sets are mergeable sets, thus they are different in lookahead symbols only. It follows that there is an item in the set . Using the Theorem 2.24 we get that the grammar is not a grammar; we get shiftreduce conflict from the set for the parser, too.
However, after merging reducereduce conflict may arise. The properties of grammar do not exclude this case. In the next example we show such a case.
Example 2.21 Let be a grammar, and the derivation rules are as follows.
This grammar is a grammar. For the viable prefix the items
for the viable prefix the items
create two canonical sets.
After merging these two sets we get a reducereduce conflict. If the input symbol is or then the handle is , but we cannot decide that if we have to use the rule or the rule for reducing.
Now we give the method for creating a parsing table. First we give the canonical sets of items
then we merge canonical sets in which the sets constructed from the core of the items are identical ones. Let
be the canonical sets.
For the calculation of the size of the action and goto tables and for filling in these tables we use the sets . The method is the same as it was in the parsers. The constructed tables are named by parsing tables.
Definition 2.25 If the filling in the parsing tables do not produce conflicts then the grammar is said to be an grammar.
The run of parser is the same as it was in parser.
Example 2.22 Denote the result of merging canonical sets and by . Let be the state which belonging to this set.
The canonical sets of the grammar of Example 2.15 were given in the Example2.17 and the mergeable sets were seen in the example 2.20. For this grammar we can create the next parsing tables.
The filling in the tables are conflict free, therefore the grammar is an grammar. The automaton of this parser is in Figure 2.18.
Example 2.23 Analyse the text using the parsing table of the previous example.
The syntax tree of the parsed text is in the Figure 2.17.
As it can be seen from the previous example, the grammars are grammars. The converse assertion is not true. In Example 2.21 there is a grammar which is , but it is not grammar.
Programming languages can be written by grammars. The most frequently used methods in compilers of programming languages is the method. The advantage of the parser is that the sizes of parsing tables are smaller than the size of parsing tables.
For example, the parsing tables for the Pascal language have a few hundreds of lines, whilst the parsers for this language have a few thousands of lines.
Exercises
2.31 Find the grammars among the following grammars (we give their derivation rules only).
1.
2.
3.
4.
2.32 Prove that the next grammars are grammars (we give their derivation rules only).
1.
2.
3.
2.33 Prove that the next grammars are not grammars (we give their derivation rules only).
1.
2.
3.
2.34 Show that a language has only one sentence.
2.35 Prove that the next grammars are grammars (we give their derivation rules only).
1.
2.
2.36 Prove that the next grammars are grammars. (we give their derivation rules only).
1 .
2.
2.37 Prove that the next grammars are not grammars for any (we give their derivation rules only).
1.
2.
2.38 Prove that the next grammars are but are not grammars (we give their derivation rules only).
1.
2.
2.39 Create parsing table for the above grammars.
2.310 Using the recursive descent method, write the parsing program for the above grammars.
2.311 Create canonical sets and the parsing tables for the above grammars.
2.312 Create merged canonical sets and the parsing tables for the above grammars.
PROBLEMS

21
Lexical analysis of a program text
The algorithm
LEXANALYSE
in Section 2.2 gives a scanner for the text that is described by only one regular expression or deterministic finite automaton, i.e. this scanner is able to analyse only one symbol. Create an automaton which executes total lexical analysis of a program language, and give the algorithm
LEXANALYSELANGUAGE
for this automaton. Let the input of the algorithm be the text of a program, and the output be the series of symbols. It is obvious that if the automaton goes into a finite state then its new work begins at the initial state, for analysing the next symbol. The algorithm finishes his work if it is at the end of the text or a lexical error is detected.
22
Series of symbols augmented with data of symbols
Modify the algorithm of the previous task on such way that the output is the series of symbols augmented with the appropriate attributes. For example, the attribute of a variable is the character string of its name, or the attribute of a number is its value and type. It is practical to write pointers to the symbols in places of data.
23
parser from
canonical sets
If we omit lookahead symbols from the items then we get items. We can define functions closure and read for items too, doing not care for lookahead symbols. Using a method similar to the method of , we can construct canonical sets
One can observe that the number of merged canonical sets is equal to the number of canonical sets, since the cores of items of the merged canonical sets are the same as the items of the canonical sets. Therefore the number of states of parser is equal to the number of states of its parser.
Using this property, we can construct canonical sets from canonical sets, by completing the items of the canonical sets with lookahead symbols. The result of this procedure is the set of canonical sets.
It is obvious that the right part of an item begins with symbol point only if this item was constructed by the function closure. (We notice that there is one exception, the item of the canonical set .) Therefore it is no need for all items of canonical sets. Let the kernel of the canonical set be the item , and let the kernel of any other canonical set be the set of the items such that there is no point at the first position on the right side of the item. We give an canonical set by its kernel, since all of items can be construct from the kernel using the function closure.
If we complete the items of the kernel of canonical sets then we get the kernel of the merged canonical sets. That is, if the kernel of an canonical set is , then from it with completions we get the kernel of the canonical set, .
If we know then we can construct easily. If , and , then . For items, if , and then we have to determine also the lookahead symbols, i.e. the symbols such that .
If and then it is sure that . In this case, we say that the lookahead symbol was spontaneously generated for this item of canonical set . The symbol do not play important role in the construction of the lookahead symbol.
If then is an element of the set , and the lookahead symbol is . In this case we say that the lookahead symbol is propagated from into the item of the set .
If the kernel of an canonical set is given then we construct the propagated and spontaneously generated lookahead symbols for items of by the following algorithm.
For all items we construct the set , where is a dummy symbol,
if and then and the symbol is spontaneously generated into the item of the set ,
if then , and the symbol is propagated from into the item of the set .
The kernel of the canonical set has only one element. The core of this element is . For this item we can give the lookahead symbol directly. Since the core of the kernel of all canonical sets are given, using the above method we can calculate all of propagated and spontaneously generated symbols.
Give the algorithm which constructs canonical sets from canonical sets using the methods of propagation and spontaneously generation.
CHAPTER NOTES

The theory and practice of compilers, computers and program languages are of the same age. The construction of first compilers date back to the 1950's. The task of writing compilers was a very hard task at that time, the first Fortran compiler took 18 manyears to implement [ 6 ]. From that time more and more precise definitions and solutions have been given to the problems of compilation, and better and better methods and utilities have been used in the construction of translators.
The development of formal languages and automata was a great leap forward, and we can say that this development was urged by the demand of writing of compilers. In our days this task is a simple routine project. New results, new discoveries are expected in the field of code optimisation only.
One of the earliest nondeterministic and backtrack algorithms appeared in the 1960's. The first two dynamic programming algorithms were the CYK (CockeYoungerKasami) algorithm from 1965–67 and the Earleyalgorithm from 1965. The idea of precedence parsers is from the end of 1970's and from the beginning of 1980's. The grammars was defined by Knuth in 1965; the definition of grammars is dated from the beginning of 1970's. grammars were studied by De Remer in 1971, the elaborating of parsing methods were finished in the beginning of 1980's [ 7 ] [ 5 ] [ 6 ].
To the middle of 1980's it became obvious that the parsing methods are the real efficient methods and since than the methods are used in compilers [ 7 ].
A lot of very excellent books deal with the theory and practice of compiles. Perhaps the most successful of them was the book of Gries [ 100 ]; in this book there are interesting results for precedence grammars. The first successful book which wrote about the new algorithms was of Aho and Ullman [ 5 ], we can find here also the CYK and the Early algorithms. It was followed by the “dragon book” of Aho and Ullman[ 6 ]; the extended and corrected issue of it was published in 1986 by authors Aho, Ullman and Sethi [ 7 ].
Without completeness we notice the books of Fischer and LeBlanc [ 74 ], Tremblay and Sorenson [ 258 ], Waite and Goos [ 267 ], Hunter [ 121 ], Pittman [ 206 ] and Mak [ 171 ]. Advanced achievements are in recently published books, among others in the book of Muchnick [ 189 ], Grune, Bal, Jacobs and Langendoen [ 102 ], in the book of Cooper and Torczon [ 52 ] and in a chapter of the book by Louden [ 169 ].