## 2.3. 2.3 Syntactic analysis

The perfect definition of a programming language includes the definition of its syntax and semantics.

The syntax of the programming languages cannot be written by context free grammars. It is possible by using context dependent grammars, two-level grammars or attribute grammars. For these grammars there are not efficient parsing methods, hence the description of a language consists of two parts. The main part of the syntax is given using context free grammars, and for the remaining part a context dependent or an attribute grammar is applied. For example, the description of the program structure or the description of the statement structure belongs to the first part, and the type checking, the scope of variables or the correspondence of formal and actual parameters belong to the second part.

The checking of properties written by context free grammars is called syntactic analysis or parsing. Properties that cannot be written by context free grammars are called form the static semantics. These properties are checked by the semantic analyser.

The conventional semantics has the name run-time semantics or dynamic semantics. The dynamic semantics can be given by verbal methods or some interpreter methods, where the operation of the program is given by the series of state-alterations of the interpreter and its environment.

We deal with context free grammars, and in this section we will use extended grammars for the syntactic analysis. We investigate on methods of checking of properties which are written by context free grammars. First we give basic notions of the syntactic analysis, then the parsing algorithms will be studied.

Definition 2.1 Let be a grammar. If and then is a sentential form. If and then is a sentence of the language defined by the grammar.

The sentence has an important role in parsing. The program written by a programmer is a series of terminal symbols, and this series is a sentence if it is correct, that is, it has not syntactic errors.

Definition 2.2 Let be a grammar and is a sentential form (). We say that is a phrase of , if there is a symbol , which and . We say that is a simple phrase of , if .

We note that every sentence is phrase. The leftmost simple phrase has an important role in parsing; it has its own name.

Definition 2.3 The leftmost simple phase of a sentence is the handle.

The leaves of the syntax tree of a sentence are terminal symbols, other points of the tree are nonterminal symbols, and the root symbol of the tree is the start symbol of the grammar.

In an ambiguous grammar there is at least one sentence, which has several syntax trees. It means that this sentence has more than one analysis, and therefore there are several target programs for this sentence. This ambiguity raises a lot of problems, therefore the compilers translate languages generated by unambiguous grammars only.

We suppose that the grammar has properties as follows:

1. the grammar is cycle free, that is, it has not series of derivations rules (),

2. the grammar is reduced, that is, there are not “unused symbols” in the grammar, all of nonterminals happen in a derivation, and from all nonterminals we can derive a part of a sentence. This last property means that for all it is true that , where and ().

As it has shown, the lexical analyser translates the program written by a programmer into series of terminal symbols, and this series is the input of syntactic analyser. The task of syntactic analyser is to decide if this series is a sentence of the grammar or it is not. To achieve this goal, the parser creates the syntax tree of the series of symbols. From the known start symbol and the leaves of the syntax tree the parser creates all vertices and edges of the tree, that is, it creates a derivation of the program. If this is possible, then we say that the program is an element of the language. It means that the program is syntactically correct.

Hence forward we will deal with left to right parsing methods. These methods read the symbols of the programs left to right. All of the real compilers use this method.

To create the inner part of the syntax tree there are several methods. One of these methods builds the syntax tree from its start symbol . This method is called top-down method. If the parser goes from the leaves to the symbol , then it uses the bottom-up parsing method.

We deal with top-down parsing methods in Subsection 2.3.1. We investigate bottom-up parsers in Subsection 2.3.2; now these methods are used in real compilers.

### 2.3.1. 2.3.1 parser

If we analyse from top to down then we start with the start symbol. This symbol is the root of syntax tree; we attempt to construct the syntax tree. Our goal is that the leaves of tree are the terminal symbols.

First we review the notions that are necessary in the top-down parsing. Then the table methods and the recursive descent method will be analysed.

#### 2.3.1.1.  grammars

Our methods build the syntax tree top-down and read symbols of the program left to right. For this end we try to create terminals on the left side of sentential forms.

Definition 2.4 If then the leftmost direct derivation of the sentential form () is , and

Definition 2.5 If all of direct derivations in () are leftmost, then this derivation is said to be leftmost derivation, and

In a leftmost derivation terminal symbols appear at the left side of the sentential forms. Therefore we use leftmost derivations in all of top-down parsing methods. Hence if we deal with top-down methods, we do not write the text “leftmost” at the arrows.

One might as well say that we create all possible syntax trees. Reading leaves from left to right, we take sentences of the language. Then we compare these sentences with the parseable text and if a sentence is same as the parseable text, then we can read the steps of parsing from the syntax tree which is belongs to this sentence. But this method is not practical; generally it is even impossible to apply.

A good idea is the following. We start at the start symbol of the grammar, and using leftmost derivations we try to create the text of the program. If we use a not suitable derivation at one of steps of parsing, then we find that, at the next step, we can not apply a proper derivation. At this case such terminal symbols are at the left side of the sentential form, that are not same as in our parseable text.

For the leftmost terminal symbols we state the theorem as follows.

Theorem 2.6 If () és , then .

The proof of this theorem is trivial. It is not possible to change the leftmost terminal symbols of sentential forms using derivation rules of a context free grammar.

This theorem is used during the building of syntax tree, to check that the leftmost terminals of the tree are same as the leftmost symbols of the parseable text. If they are different then we created wrong directions with this syntax tree. At this case we have to make a backtrack, and we have to apply an other derivation rule. If it is impossible (since for example there are no more derivation rules) then we have to apply a backtrack once again.

General top-down methods are realized by using backtrack algorithms, but these backtrack steps make the parser very slow. Therefore we will deal only with grammars such that have parsing methods without backtracks.

The main properties of grammars are the following. If, by creating the leftmost derivation (), we obtain the sentential form () at some step of this derivation, and our goal is to achieve , then the next step of the derivation for nonterminal is determinable unambiguously from the first symbols of .

To look ahead symbols we define the function .

Definition 2.7 Let be the set as follows.

The set consists of the first symbols of ; for , it consists the full . If , then .

Definition 2.8 The grammar is a grammar , if for derivations

() the equality

implies

Using this definition, if a grammar is a grammar then the symbol after the parsed determine the next derivation rule unambiguously (Figure 2.8).

One can see from this definition that if a grammar is an grammar then for all it is also an grammar. If we speak about grammar then we also mean that is the least number such that the properties of the definition are true.

Example 2.5 The next grammar is a grammar. Let be a grammar whose derivation rules are:

We have to use the derivation for the start symbol if the next symbol of the parseable text is or . We use the derivation if the next symbol is the mark #.

Example 2.6 The next grammar is a grammar. Let be a grammar whose the derivation rules are:

One can see that at the last step of derivations

and

if we look ahead one symbol, then in both derivations we obtain the symbol . The proper rule for symbol is determined to look ahead two symbols ( or ).

There are context free grammars such that are not grammars. For example the next grammar is not grammar for any .

Example 2.7 Let be a grammar whose the derivation rules are:

consists of sentences és . If we analyse the sentence , then at the first step we can not decide by looking ahead symbols whether we have to use the derivation or , since for all .

By the definition of the grammar, if we get the sentential form using leftmost derivations, then the next symbol determines the next rule for symbol . This is stated in the next theorem.

Theorem 2.9 Grammar is a grammar iff

implies

If there is a rule in the grammar, then the set consists the length prefixes of terminal series generated from . It implies that, for deciding the property , we have to check not only the derivation rules, but also the infinite derivations.

We can give good methods, that are used in the practice, for grammars only. We define the follower-series, which follow a symbol or series of symbols.

Definition 2.10 , and if , then () .

The second part of the definition is necessary because if there are no symbols after the in the derivation , that is , then the next symbol after is the mark # only.

() consists of terminal symbols that can be immediately after the symbol in the derivation

Theorem 2.11 The grammar is a grammar iff, for all nonterminal and for all derivation rules ,

In this theorem the expression means that we have to concatenate to the elements of set separately, and for all elements of this new set we have to apply the function .

It is evident that Theorem 2.11 is suitable to decide whether a grammar is or it is not.

Hence forward we deal with languages determined by grammars, and we investigate the parsing methods of languages. For the sake of simplicity, we omit indexes from the names of functions és .

The elements of the set are determined using the next algorithm.

First( )

  1
IF

2
THEN

3
IF

, where    4
THEN

5
IF

, where    6
THEN

IF

7
THEN

8
ELSE

9
FOR
all
10
DO

11
FOR

TO

12
DO

IF

13
THEN

14
IF

15
THEN

16
IF

17
THEN

18
FOR

TO

19
DO

IF

20
THEN

21
IF

22
THEN

23
RETURN



In lines 1–4 the set is given for and a terminal symbol . In lines 5–15 we construct the elements of this set for a nonterminal . If is derivated from then we put symbol into the set in lines 6–7 and 14–15. If the argument is a symbol stream then the elements of the set are constructed in lines 16–22. We notice that we can terminate the fOR cycle in lines 11 and 18 if , since in this case it is not possible to derive symbol from .

In Theorem 2.11 and hereafter, it is necessary to know the elements of the set . The next algorithm constructs this set.

Follow( )

  1
IF

2
THEN

3
ELSE

4
FOR
all rules    5
DO

IF

6
THEN

7
IF

8
THEN

9
ELSE

10
RETURN



The elements of the set get into the set . In lines 4–9 we check that, if the argumentum is at the right side of a derivation rule, what symbols may stand immediately after him. It is obvious that no is in this set, and the symbol is in the set only if the argumentum is the rightmost symbol of a sentential form.

#### 2.3.1.2.  Parsing with table

Suppose that we analyse a series of terminal symbols and the part has already been analysed without errors. We analyse the text with a top-down method, so we use leftmost derivations. Suppose that our sentential form is , that is, it has form or () (Figure 2.9).

In the first case the next step is the substitution of symbol . We know the next element of the input series, this is the terminal , therefore we can determine the correct substitution of symbol . This substitution is the rule for which . If there is such a rule then, according to the definition of grammar, there is exactly one. If there is not such a rule, then a syntactic error was found.

In the second case the next symbol of the sentential form is the terminal symbol , thus we look out for the symbol as the next symbol of the analysed text. If this comes true, that is, , then the symbol is a correct symbol and we can go further. We put the symbol into the already analysed text. If , then here is a syntactic error. We can see that the position of the error is known, and the erroneous symbol is the terminal symbol .

The action of the parser is the following. Let # be the sign of the right end of the analysed text, that is, the mark # is the last symbol of the text. We use a stack through the analysing, the bottom of the stack is signed by mark #, too. We give serial numbers to derivation rules and through the analysing we write the number of the applied rule into a list. At the end of parsing we can write the syntax tree from this list (Figure 2.10).

We sign the state of the parser using triples . The symbol is the text not analysed yet. is the part of the sentential form corresponding to the not analysed text; this information is in the stack, the symbol is at the top of the stack. is the list of the serial numbers of production rules.

If we analyse the text then we observe the symbol at the top of the stack, and the symbol that is the first symbol of the not analysed text. The name of the symbol is actual symbol. There are pointers to the top of the stack and to the actual symbol.

We use a top down parser, therefore the initial content of the stack is . If the initial analysed text is , then the initial state of the parsing process is the triple , where is the sign of the empty list.

We analyse the text, the series of symbols using a parsing table. The rows of this table sign the symbols at the top of the stack, the columns of the table sign the next input symbols, and we write mark # to the last row and the last column of the table. Hence the number of rows of the table is greater by one than the number of symbols of the grammar, and the number of columns is greater by one than the number of terminal symbols of the grammar.

The element of the table is as follows.

We fill in the parsing table using the following algorithm.

LL(1)-Table-Fill-in( )

  1
FOR
all    2
DO

IF

the -th rule   3
THEN

FOR
all -ra   4
DO

5
IF

6
THEN

FOR
all    7
DO

8
FOR
all    9
DO

10    11
FOR
all  and all   12
DO

IF

”empty”  13
THEN

14
RETURN



At the line 10 we write the text accept into the right lower corner of the table. At the lines 8–9 we write the text pop into the main diagonal of the square labelled by terminal symbols. The program in lines 1–7 writes a tuple in which the first element is the right part of a derivation rule and the second element is the serial number of this rule. In lines 12–13 we write error texts into the empty positions.

The actions of the parser are written by state-transitions. The initial state is , where the initial text is , and the parsing process will be finished if the parser goes into the state , this state is the final state. If the text is in an intermediate step, and the symbol is at the top of stack, then the possible state-transitions are as follows.

The letters O.K. mean that the analysed text is syntactically correct; the text ERROR means that a syntactic error is detected.

The actions of this parser are written by the next algorithm.

LL(1)-Parser( )

  1  ,    2
REPEAT
3
IF

és    4
THEN

5
ELSE

IF

6
THEN

Then .   7
ELSE

IF

8
THEN

Then .   9
ELSE

Then .  10
UNTIL

OR

11
RETURN



The input parameters of this algorithm are the text and the parsing table . The variable describes the state of the parser: its value is analyse, during the analysis, and it is either O.K. or ERROR. at the end. The parser determines his action by the actual symbol and by the symbol at the top of the stack, using the parsing table . In the line 3–4 the parser builds the syntax tree using the derivation rule . In lines 5–6 the parser executes a shift action, since there is a symbol at the top of the stack. At lines 8–9 the algorithm finishes his work if the stack is empty and it is at the end of the text, otherwise a syntactic error was detected. At the end of this work the result is O.K. or ERROR in the variable , and, as a result, there is the triple at the output of this algorithm. If the text was correct, then we can create the syntax tree of the analysed text from the third element of the triple. If there was an error, then the first element of the triple points to the position of the erroneous symbol.

Example 2.8 Let be a grammar , where the set of derivation rules:

From these rules we can determine the sets. To fill in the parsing table, the following sets are required:

,

,

,

,

,

,

,

.

The parsing table is as follows. The empty positions in the table mean errors

Example 2.9 Using the parsing table of the previous example, analyse the text .

The syntax tree of the analysed text is the Figure 2.11.

#### 2.3.1.3.  Recursive-descent parsing method

There is another frequently used method for the backtrackless top-down parsing. Its essence is that we write a real program for the applied grammar. We create procedures to the symbols of grammar, and using these procedures the recursive procedure calls realize the stack of the parser and the stack management. This is a top-down parsing method, and the procedures call each other recursively; it is the origin of the name of this method, that is, recursive-descent method.

To check the terminal symbols we create the procedure Check. Let the parameter of this procedure be the, that is the leftmost unchecked terminal symbol of the sentential form, and let the actual symbol be the symbol which is analysed in that moment.

       procedure Check(a);       begin          if actual_symbol = a             then Next_symbol             else Error_report       end;

The procedure Next_symbol reads the next symbol, it is a call for the lexical analyser. This procedure determines the next symbol and put this symbol into the actual_symbol variable. The procedure Error_report creates an error report and then finishes the parsing.

We create procedures to symbols of the grammar as follows. The procedure of the nonterminal symbol is the next.

       procedure A;       begin          T(A)       end;

where T(A) is determined by symbols of the right part of derivation rule having symbol in its left part.

The grammars which are used for syntactic analysis are reduced grammars. It means that no unnecessary symbols in the grammar, and all of symbols occur at the left side at least one reduction rule. Therefore, if we consider the symbol , there is at least one production rule.

1. If there is only one production rule for the symbol ,

1. let the program of the rule is as follows: Check(a),

2. for the rule we give the procedure call B,

3. for the rule we give the next block:

4. begin

 T(X_1);

 T(X_2);

 ...

 T(X_n)

end;

2. If there are more rules for the symbol :

1. If the rules are -free, that is from () it is not possible to deduce , then T(A)

 case actual_symbol of

 First(alpha_1) : T(alpha_1);

 First(alpha_2) : T(alpha_2);

 ...

 First(alpha_n) : T(alpha_n)

end;

where First(alpha_i) is the sign of the set .

We note that this is the first point of the method of recursive-descent parser where we use the fact that the grammar is an grammar.

2. We use the grammar to write a programming language, therefore it is not comfortable to require that the grammar is a -free grammar. For the rules we create the next T(A) program:

case actual_symbol of

 First(alpha_1) : T(alpha_1);

 First(alpha_2) : T(alpha_2);

 ...

 First(alpha_(n-1)) : T(alpha_(n-1));

 Follow(A) : skip

end;

where Follow(A) is the set .

In particular, if the rules for some , that is , then the -th row of the case statement is

Follow(A) : skip

In the program T(A), if it is possible, we use if-then-else or while statement instead of the statement case.

The start procedure, that is the main program of this parsing program, is the procedure which is created for the start symbol of the grammar.

We can create the recursive-descent parsing program with the next algorithm. The input of this algorithm is the grammar , and the result is parsing program . In this algorithm we use a WRITE-PROGRAM procedure, which concatenates the new program lines to the program . We will not go into the details of this algorithm.

Create-Rec-Desc( )

  1     2
WRITE-PROGRAM(
3    procedure Check(a);   4    begin   5    if actual_symbol = a   6       then Next_symbol   7       else Error_report   8    end;   9
)
10
FOR
all symbol  of the grammar   11
DO

IF

12
THEN

WRITE-PROGRAM(
13          program S;  14          begin  15
REC-DESC-STAT 

16          end;  17
)
18
ELSE

WRITE-PROGRAM(
19          procedure A;  20          begin  21
REC-DESC-STAT 

22          end;  23
)
24
RETURN



The algorithm creates the Check procedure in lines 2–9. Then, for all nonterminals of grammar , it determines their procedures using the algorithm REC-DESC-STAT . In the lines 11–17, we can see that for the start symbol we create the main program. The output of the algorithm is the parsing program.

Rec-Desc-Stat( )

  1
IF
there is only one rule    2
THEN

REC-DESC-STAT1 

.   3
ELSE

REC-DESC-STAT2 

.   4
RETURN



The form of the statements of the parsing program depends on the derivation rules of the symbol . Therefore the algorithm REC-DESC-STAT divides the next tasks into two parts. The algorithm REC-DESC-STAT1 deals with the case when there is only one derivation rule, and the algorithm REC-DESC-STAT2 creates the program for the alternatives.

Rec-Desc-Stat1( )

  1
IF

2
THEN

WRITE-PROGRAM(
3       Check(a)   4
)
5
IF

6
THEN

WRITE-PROGRAM(
7       B   8
)
9
IF

10
THEN

WRITE-PROGRAM(
11       begin 11   12
REC-DESC-STAT1 

;
13
REC-DESC-STAT1 

;
14             15
REC-DESC-STAT1 

16       end;  17
RETURN



Rec-Desc-Stat2( )

  1
IF
the rules  are - free   2
THEN

WRITE-PROGRAM(
3       case actual_symbol of   4          First(alpha_1) : 

REC-DESC-STAT1 

;
5          ...   6          First(alpha_n) : 

REC-DESC-STAT1 

7       end;   8
)
9
IF
there is a -rule,
10
THEN

WRITE-PROGRAM(
11       case actual_symbol of  12          First(alpha_1) : 

REC-DESC-STAT1 

;
13          ...  14          First(alpha_(i-1)) : 

REC-DESC-STAT1 

;
15          Follow(A) : skip;  16          First(alpha_(i+1)) : 

REC-DESC-STAT1 

 ;
17          ...  18          First(alpha_n) : 

REC-DESC-STAT1 

19       end;  20       )  21
RETURN



These two algorithms create the program described above.

Checking the end of the parsed text is achieved by the recursive- descent parsing method with the next modification. We generate a new derivation rule for the end mark #. If the start symbol of the grammar is , then we create the new rule , where the new symbol is the start symbol of our new grammar. The mark # is considered as terminal symbol. Then we generate the parsing program for this new grammar.

Example 2.10 We augment the grammar of the Example 2.8 in the above manner. The production rules are as follows.

In the example 2.8 we give the necessary and sets. We use the next sets:

,

,

,

,

,

.

In the comments of the program lines we give the using of these sets. The first characters of the comment are the character pair --.

The program of the recursive-descent parser is the following.

       program S';       begin          E;          Check(#)       end.       procedure E;       begin          T;          E'       end;       procedure E';       begin          case actual_symbol of          + : begin -- First(+TE')             Check(+);             T;             E'          end;          ),# : skip -- Follow(E')          end       end;       procedure T;       begin          F;          T'       end;       procedure T';       begin          case actual_symbol of          * : begin -- First(*FT')             Check(*);             F;             T'          end;          +,),# : skip -- Follow(T')          end       end;       procedure F;       begin          case actual_symbol of          ( : begin -- First((E))             Check(();             E;             Check())          end;          i : Check(i) -- First(i)          end       end;

We can see that the main program of this parser belongs to the symbol .

### 2.3.2. 2.3.2 parsing

If we analyse from bottom to up, then we start with the program text. We search the handle of the sentential form, and we substitute the nonterminal symbol that belongs to the handle, for this handle. After this first step, we repeat this procedure several times. Our goal is to achieve the start symbol of the grammar. This symbol will be the root of the syntax tree, and by this time the terminal symbols of the program text are the leaves of the tree.

First we review the notions which are necessary in the parsing.

To analyse bottom-up, we have to determine the handle of the sentential form. The problem is to create a good method which finds the handle, and to find the best substitution if there are more than one possibilities.

Definition 2.12 If , then the rightmost substitution of the sentential form () is , that is

Definition 2.13 If the derivation () all of the substitutions were rightmost substitution, then this derivation is a rightmost derivation,

In a rightmost derivation, terminal symbols are at the right side of the sentential form. By the connection of the notion of the handle and the rightmost derivation, if we apply the steps of a rightmost derivation backwards, then we obtain the steps of a bottom-up parsing. Hence the bottom-up parsing is equivalent with the “inverse” of a rightmost derivation. Therefore, if we deal with bottom-up methods, we will not write the text “rightmost” at the arrows.

General bottom-up parsing methods are realized by using backtrack algorithms. They are similar to the top-down parsing methods. But the backtrack steps make the parser very slow. Therefore we only deal with grammars such that have parsing methods without backtracks.

Hence forward we produce a very efficient algorithm for a large class of context-free grammars. This class contains the grammars for the programming languages.

The parsing is called parsing; the grammar is called grammar. means the “Left to Right” method, and means that if we look ahead symbols then we can determine the handles of the sentential forms. The parsing method is a shift-reduce method.

We deal with parsing only, since for all grammar there is an equivalent grammar. This fact is very important for us since, using this type of grammars, it is enough to look ahead one symbol in all cases.

Creating parsers is not an easy task. However, there are such standard programs (for example the yacc in UNIX systems), that create the complete parsing program from the derivation rules of a grammar. Using these programs the task of writing parsers is not too hard.

After studying the grammars we will deal with the parsing method. This method is used in the compilers of modern programming languages.

#### 2.3.2.1.  grammars.

As we did previously, we write a mark # to the right end of the text to be analysed. We introduce a new nonterminal symbol and a new rule into the grammar.

Definition 2.14 Let be the augmented grammar belongs to grammar , where augmented grammar

Assign serial numbers to the derivation rules of grammar, and let be the 0th rule. Using this numbering, if we apply the 0th rule, it means that the parsing process is concluded and the text is correct.

We notice that if the original start symbol does not happen on the right side of any rules, then there is no need for this augmentation. However, for the sake of generality, we deal with augmented grammars only.

Definition 2.15 The augmented grammar is an LR(k) grammar , if for derivations

() the equality

implies

The feature of grammars is that, in the sentential form , looking ahead symbol from unambiguously decides if is or is not the handle. If the handle is , then we have to reduce the form using the rule , that results the new sentential form is . Its reason is the following: suppose that, for sentential forms and , (their prefixes are same), , and we can reduce to and to . In this case, since the grammar is a grammar, and hold. Therefore in this case either the handle is or never is the handle.

Example 2.11 Let be a grammar and let the derivation rules be as follows.

This grammar is not an grammar, since using notations of the definition, in the derivations

it holds that , and .

Example 2.12 The next grammar is a grammar. , the derivation rules are:

.

In the next example we show that there is a context-free grammar, such that is not grammar for any .

Example 2.13 Let be a grammar and let the derivation rules be

Now for all

and

but

It is not sure that, for a grammar, we can find an equivalent grammar. However, grammars have this nice property.

Theorem 2.16 For all grammar there is an equivalent grammar.

The great significance of this theorem is that it makes sufficient to study the grammars instead of grammars.

#### 2.3.2.2.  canonical sets.

Now we define a very important notion of the parsings.

Definition 2.17 If is the handle of the () sentential form, then the prefixes of are the viable prefixes of .

Example 2.14 Let be a grammar and the derivation rule as follows.

is a sentential form, and the first is the handle. The viable prefixes of this sentential form are .

By the above definition, symbols after the handle are not parts of any viable prefix. Hence the task of finding the handle is the task of finding the longest viable prefix.

For a given grammar, the set of viable prefixes is determined, but it is obvious that the size of this set is not always finite.

The significance of viable prefixes are the following. We can assign states of a deterministic finite automaton to viable prefixes, and we can assign state transitions to the symbols of the grammar. From the initial state we go to a state along the symbols of a viable prefix. Using this property, we will give a method to create an automaton that executes the task of parsing.

Definition 2.18 If is a rule of a grammar, then let

be a LR(1)-item, where is the core of the -item, and is the lookahead symbol of the -item.

The lookahead symbol is instrumental in reduction, i.e. it has form . It means that we can execute reduction only if the symbol follows the handle .

Definition 2.19 The -item is valid for the viable prefix if

and is the first symbol of or if then .

Example 2.15 Let a grammar and the derivation rules as follows.

Using these rules, we can derive . Here is a viable prefix, and is valid for this viable prefix. Similarly, , and -item is valid for viable prefix .

Creating a parser, we construct the canonical sets of -items. To achieve this we have to define the closure and read functions.

Definition 2.20 Let the set be a set of -items for a given grammar. The set closure() consists of the next -items:

1. every element of the set is an element of the set ,

2. if , and is a derivation rule of the grammar, then for all ,

3. the set is needed to expand using the step 2 until no more items can be added to it.

By definitions, if the -item is valid for the viable prefix , then the -item is valid for the same viable prefix in the case of . (Figure 2.14). It is obvious that the function closure creates all of -items which are valid for viable prefix .

We can define the function , i.e. the closure of set by the following algorithm. The result of this algorithm is the set .

Closure-Set-of-Items( )

  1     2
FOR
all
-item   3
DO

4
RETURN



Closure-Item( )

  1     2
IF
the -item  has form    3
THEN

4          5
REPEAT
6
FOR
for all  which have form    7
DO

FOR
for all rules    8
DO

FOR
for all symbols    9
DO

10            11
IF

12
THEN

13                  14
UNTIL

15
RETURN



The algorithm CLOSURE-ITEM creates , the closure of item . If, in the argument , the “point” is followed by a terminal symbol, then the result is this item only (line 1). If in the “point” is followed by a nonterminal symbol , then we can create new items from every rule having the symbol at their left side (line 9). We have to check this condition for all new items, too, the rEPEAT cycle is in line 5–14. These steps are executed until no more items can be added (line 14). The set contains the items to be checked, the set contains the new items. We can find the operation in line 10.

Definition 2.21 Let be a set of -items for the grammar . Then the set read() () consists of the following -items.

1. if , then all items of the set are in ,

2. the set is extended using step 1 until no more items can be added to it.

The function “reads symbol ” in items of , and after this operation the sign "point" in the items gets to the right side of . If the set contains the valid -items for the viable prefix then the set contains the valid -items for the viable prefix .

The algorithm READ-SET-OF-ITEMS executes the function read. The result is the set .

Read-Set( )

  1     2
FOR
all    3
DO

4
RETURN



Read-Item( )

  1
IF

and    2
THEN

3
ELSE

4
RETURN



Using these algorithms we can create all of items which writes the state after reading of symbol .

Now we introduce the following notation for -items, to give shorter descriptions. Let

be a notation for items

Example 2.16 The -item is an item of the grammar in the example 2.15. For this item

We can create the canonical sets of -items or shortly the -canonical sets with the following method.

Definition 2.22 Canonical sets of -items are the following.

,

Create the set for a symbol . If this set is not empty and it is not equal to canonical set then it is the next canonical set .

Repeat this operation for all possible terminal and nonterminal symbol . If we get a nonempty set which is not equal to any of previous sets then this set is a new canonical set, and its index is greater by one as the maximal index of previously generated canonical sets.

repeat the above operation for all previously generated canonical sets and for all symbols of the grammar until no more items can be added to it.

The sets

are the canonical sets of -items of the grammar .

The number of elements of -items for a grammar is finite, hence the above method is terminated in finite time.

The next algorithm creates canonical sets of the grammar .

Create-Canonical-Sets( )

  1     2     3     4
REPEAT
5       6
FOR
all -re   7
DO

8
FOR
all -re   9
DO

10
IF

and   11
THEN

12                        13                        14                        15
UNTIL

16
RETURN



The result of the algorithm is . The first canonical set is the set in the line 2. Further canonical sets are created by functions CLOSURE-SET-OF-ITEMS(READ-SET) in the line 9. The program in the line 10 checks that the new set differs from previous sets, and if the answer is true then this set will be a new set in lines 11–12. The fOR cycle in lines 6–14 guarantees that these operations are executed for all sets previously generated. In lines 3–14 the rEPEAT cycle generate new canonical sets as long as it is possible.

Example 2.17 The canonical sets of -items for the Example 2.15 are as follows.

The automaton of the parser is in Figure 2.15.

#### 2.3.2.3.  parser.

If the canonical sets of -items

were created, then assign the state of an automaton to the set . Relation between the states of the automaton and the canonical sets of -items is stated by the next theorem. This theorem is the ”great”' theorem of the -parsing.

Theorem 2.23 The set of the -items being valid for a viable prefix can be assigned to the automaton-state such that there is path from the initial state to state labeled by .

This theorem states that we can create the automaton of the parser using canonical sets. Now we give a method to create this parser from canonical sets of -items.

The deterministic finite automaton can be described with a table, that is called parsing table. The rows of the table are assigned to the states of the automaton.

The parsing table has two parts. The first is the action table. Since the operations of parser are determined by the symbols of analysed text, the action table is divided into columns labeled by the terminal symbols. The action table contains information about the action performing at the given state and at the given symbol. These actions can be shifts or reductions. The sign of a shift operation is , where is the next state. The sign of the reduction is , where is the serial number of the applied rule. The reduction by the rule having the serial number zero means the termination of the parsing and that the parsed text is syntactically correct; for this reason we call this operation accept.

The second part of the parsing table is the goto table. In this table are informations about shifts caused by nonterminals. (Shifts belong to terminals are in the action table.)

Let be the set of states of the automata. The -th row of the table is filled in from the -items of canonical set .

The -th row of the action table:

• if and then ,

• if and , then , where is the -th rule of the grammar,

• if , then .

The method of filling in the goto table:

• if , then .

• In both table we have to write the text error into the empty positions.

These action and goto tables are called canonical parsing tables.

Theorem 2.24 The augmented grammar is grammar iff we can fill in the parsing tables created for this grammar without conflicts.

We can fill in the parsing tables with the next algorithm.

Fill-in-LR(1)-Table( )

  1
FOR
all LR(1) canonical sets    2
DO

FOR
all LR(1)-items   3
IF

and    4
THEN

5
IF

and  and  the -th rule   6
THEN

7
IF

8
THEN

9
IF

10
THEN

11
FOR
all   12
DO

IF

“empty”  13
THEN

14
FOR
all   15
DO

IF

”empty”  16
THEN

17
RETURN

action, goto


We fill in the tables its line-by-line. In lines 2–6 of the algorithm we fill in the action table, in lines 9–10 we fill in the goto table. In lines 11–13 we write the error into the positions which remained empty.

Now we deal with the steps of the parsing. (Figure 2.16).

The state of the parsing is written by configurations. A configuration of the parser consists of two parts, the first is the stack and the second is the unexpended input text.

The stack of the parsing is a double stack, we write or read two data with the operations push or pop. The stack consists of pairs of symbols, the first element of pairs there is a terminal or nonterminal symbol, and the second element is the serial number of the state of automaton. The content of the start state is .

The start configuration is , where means the unexpected text.

The parsing is successful if the parser moves to final state. In the final state the content of the stack is , and the parser is at the end of the text.

Suppose that the parser is in the configuration . The next move of the parser is determined by .

State transitions are the following.

• If , i.e. the parser executes a shift, then the actual symbol and the new state are written into the stack. That is, the new configuration is

• If , then we execute a reduction by the -th rule . In this step we delete rows, i.e. we delete elements from the stack, and then we determine the new state using the goto table. If after the deletion there is the state at the top of the stack, then the new state is .

where .

• If , then the parsing is completed, and the analysed text was correct.

• If , then the parsing terminates, and a syntactic error was discovered at the symbol .

The parser is often named canonical parser.

Denote the action and goto tables together by . We can give the following algorithm for the steps of parser.

LR(1)-Parser( )

  1  ,    2
REPEAT
3       4
IF

5
THEN

6
ELSE

IF

and  is the -th rule and   7           and    8
THEN

9
ELSE

IF

10
THEN

11
ELSE

12
UNTIL

OR

13
RETURN



The input parameters of the algorithm are the text and table . The variable indicates the action of the parser. It has value parsing in the intermediate states, and its value is O.K. or ERROR at the final states. In line 3 we detail the configuration of the parser, that is necessary at lines 6–8. Using the action table, the parser determines its move from the symbol at the top of the stack and from the actual symbol . In lines 4–5 we execute a shift step, in lines 6–8 a reduction. The algorithm is completed in lines 9–11. At this moment, if the parser is at the end of text and the state 0 is at the top of stack, then the text is correct, otherwise a syntax error was detected. According to this, the output of the algorithm is O.K. or ERROR, and the final configuration is at the output, too. In the case of error, the first symbol of the second element of the configuration is the erroneous symbol.

Example 2.18 The action and goto tables of the parser for the grammar of Example 2.15 are as follows. The empty positions denote errors.

Example 2.19 Using the tables of the previous example, analyse the text .

The syntax tree of the sentence is in Figure 2.11.

#### 2.3.2.4.  parser.

Our goal is to decrease the number of states of the parser, since not only the size but the speed of the compiler is dependent on the number of states. At the same time, we wish not to cut radically the set of grammars and languages, by using our new method.

There are a lot of -items in the canonical sets, such that are very similar: their core are the same, only their lookahead symbols are different. If there are two or more canonical sets in which there are similar items only, then we merge these sets.

If the canonical sets és a are mergeable, then let .

Execute all of possible merging of canonical sets. After renumbering the indexes we obtain sets ; these are the merged canonical sets or canonical sets.

We create the parser from these united canonical sets.

Example 2.20 Using the canonical sets of the example 2.17, we can merge the next canonical sets:

and ,

and ,

and .

In the Figure 2.15 it can be seen that mergeable sets are in equivalent or similar positions in the automaton.

There is no difficulty with the function read if we use merged canonical sets. If

and

then

We can prove this on the following way. By the definition of function read, the set depends on the core of -items in only, and it is independent of the lookahead symbols. Since the cores of -items in the sets are the same, the cores of -items of

are also the same. It follows that these sets are mergeable into a set , thus .

However, after merging canonical sets of -items, elements of this set can raise difficulties. Suppose that .

• After merging there are not shift-shift conflicts. If

and

then there is a shift for the symbol and we saw that the function read does not cause problem, i.e. the set is equal to the set .

• If there is an item

in the canonical set and there is an item

in the set a , then the merged set is an inadequate set with the symbol , i.e. there is a shift-reduce conflict in the merged set.

But this case never happens. Both items are elements of the set and of the set . These sets are mergeable sets, thus they are different in lookahead symbols only. It follows that there is an item in the set . Using the Theorem 2.24 we get that the grammar is not a grammar; we get shift-reduce conflict from the set for the parser, too.

• However, after merging reduce-reduce conflict may arise. The properties of grammar do not exclude this case. In the next example we show such a case.

Example 2.21 Let be a grammar, and the derivation rules are as follows.

This grammar is a grammar. For the viable prefix the -items

for the viable prefix the -items

create two canonical sets.

After merging these two sets we get a reduce-reduce conflict. If the input symbol is or then the handle is , but we cannot decide that if we have to use the rule or the rule for reducing.

Now we give the method for creating a parsing table. First we give the canonical sets of -items

then we merge canonical sets in which the sets constructed from the core of the items are identical ones. Let

be the canonical sets.

For the calculation of the size of the action and goto tables and for filling in these tables we use the sets . The method is the same as it was in the parsers. The constructed tables are named by parsing tables.

Definition 2.25 If the filling in the parsing tables do not produce conflicts then the grammar is said to be an grammar.

The run of parser is the same as it was in parser.

Example 2.22 Denote the result of merging canonical sets and by . Let be the state which belonging to this set.

The canonical sets of the grammar of Example 2.15 were given in the Example2.17 and the mergeable sets were seen in the example 2.20. For this grammar we can create the next parsing tables.

The filling in the tables are conflict free, therefore the grammar is an grammar. The automaton of this parser is in Figure 2.18.

Example 2.23 Analyse the text using the parsing table of the previous example.

The syntax tree of the parsed text is in the Figure 2.17.

As it can be seen from the previous example, the grammars are grammars. The converse assertion is not true. In Example 2.21 there is a grammar which is , but it is not grammar.

Programming languages can be written by grammars. The most frequently used methods in compilers of programming languages is the method. The advantage of the parser is that the sizes of parsing tables are smaller than the size of parsing tables.

For example, the parsing tables for the Pascal language have a few hundreds of lines, whilst the parsers for this language have a few thousands of lines.

Exercises

2.3-1 Find the grammars among the following grammars (we give their derivation rules only).

1.

2.

3.

4.

2.3-2 Prove that the next grammars are grammars (we give their derivation rules only).

1.

2.

3.

2.3-3 Prove that the next grammars are not grammars (we give their derivation rules only).

1.

2.

3.

2.3-4 Show that a language has only one sentence.

2.3-5 Prove that the next grammars are grammars (we give their derivation rules only).

1.

2.

2.3-6 Prove that the next grammars are grammars. (we give their derivation rules only).

1 .

2.

2.3-7 Prove that the next grammars are not grammars for any (we give their derivation rules only).

1.

2.

2.3-8 Prove that the next grammars are but are not grammars (we give their derivation rules only).

1.

2.

2.3-9 Create parsing table for the above grammars.

2.3-10 Using the recursive descent method, write the parsing program for the above grammars.

2.3-11 Create canonical sets and the parsing tables for the above grammars.

2.3-12 Create merged canonical sets and the parsing tables for the above grammars.

 PROBLEMS

2-1 Lexical analysis of a program text

The algorithm LEX-ANALYSE in Section 2.2 gives a scanner for the text that is described by only one regular expression or deterministic finite automaton, i.e. this scanner is able to analyse only one symbol. Create an automaton which executes total lexical analysis of a program language, and give the algorithm LEX-ANALYSE-LANGUAGE for this automaton. Let the input of the algorithm be the text of a program, and the output be the series of symbols. It is obvious that if the automaton goes into a finite state then its new work begins at the initial state, for analysing the next symbol. The algorithm finishes his work if it is at the end of the text or a lexical error is detected.

2-2 Series of symbols augmented with data of symbols

Modify the algorithm of the previous task on such way that the output is the series of symbols augmented with the appropriate attributes. For example, the attribute of a variable is the character string of its name, or the attribute of a number is its value and type. It is practical to write pointers to the symbols in places of data.

2-3    parser from   canonical sets

If we omit lookahead symbols from the -items then we get -items. We can define functions closure and read for -items too, doing not care for lookahead symbols. Using a method similar to the method of , we can construct canonical sets

One can observe that the number of merged canonical sets is equal to the number of canonical sets, since the cores of -items of the merged canonical sets are the same as the items of the canonical sets. Therefore the number of states of parser is equal to the number of states of its parser.

Using this property, we can construct canonical sets from canonical sets, by completing the items of the canonical sets with lookahead symbols. The result of this procedure is the set of canonical sets.

It is obvious that the right part of an -item begins with symbol point only if this item was constructed by the function closure. (We notice that there is one exception, the item of the canonical set .) Therefore it is no need for all items of canonical sets. Let the kernel of the canonical set be the -item , and let the kernel of any other canonical set be the set of the -items such that there is no point at the first position on the right side of the item. We give an canonical set by its kernel, since all of items can be construct from the kernel using the function closure.

If we complete the items of the kernel of canonical sets then we get the kernel of the merged canonical sets. That is, if the kernel of an canonical set is , then from it with completions we get the kernel of the canonical set, .

If we know then we can construct easily. If , and , then . For -items, if , and then we have to determine also the lookahead symbols, i.e. the symbols such that .

If and then it is sure that . In this case, we say that the lookahead symbol was spontaneously generated for this item of canonical set . The symbol do not play important role in the construction of the lookahead symbol.

If then is an element of the set , and the lookahead symbol is . In this case we say that the lookahead symbol is propagated from into the item of the set .

If the kernel of an canonical set is given then we construct the propagated and spontaneously generated lookahead symbols for items of by the following algorithm.

For all items we construct the set , where is a dummy symbol,

• if and then and the symbol is spontaneously generated into the item of the set ,

• if then , and the symbol is propagated from into the item of the set .

The kernel of the canonical set has only one element. The core of this element is . For this item we can give the lookahead symbol directly. Since the core of the kernel of all canonical sets are given, using the above method we can calculate all of propagated and spontaneously generated symbols.

Give the algorithm which constructs canonical sets from canonical sets using the methods of propagation and spontaneously generation.

 CHAPTER NOTES

The theory and practice of compilers, computers and program languages are of the same age. The construction of first compilers date back to the 1950's. The task of writing compilers was a very hard task at that time, the first Fortran compiler took 18 man-years to implement []. From that time more and more precise definitions and solutions have been given to the problems of compilation, and better and better methods and utilities have been used in the construction of translators.

The development of formal languages and automata was a great leap forward, and we can say that this development was urged by the demand of writing of compilers. In our days this task is a simple routine project. New results, new discoveries are expected in the field of code optimisation only.

One of the earliest nondeterministic and backtrack algorithms appeared in the 1960's. The first two dynamic programming algorithms were the CYK (Cocke-Younger-Kasami) algorithm from 1965–67 and the Earley-algorithm from 1965. The idea of precedence parsers is from the end of 1970's and from the beginning of 1980's. The grammars was defined by Knuth in 1965; the definition of grammars is dated from the beginning of 1970's. grammars were studied by De Remer in 1971, the elaborating of parsing methods were finished in the beginning of 1980's [] [] [].

To the middle of 1980's it became obvious that the parsing methods are the real efficient methods and since than the methods are used in compilers [].

A lot of very excellent books deal with the theory and practice of compiles. Perhaps the most successful of them was the book of Gries []; in this book there are interesting results for precedence grammars. The first successful book which wrote about the new algorithms was of Aho and Ullman [], we can find here also the CYK and the Early algorithms. It was followed by the “dragon book” of Aho and Ullman[]; the extended and corrected issue of it was published in 1986 by authors Aho, Ullman and Sethi [].

Without completeness we notice the books of Fischer and LeBlanc [], Tremblay and Sorenson [], Waite and Goos [], Hunter [], Pittman [] and Mak []. Advanced achievements are in recently published books, among others in the book of Muchnick [], Grune, Bal, Jacobs and Langendoen [], in the book of Cooper and Torczon [] and in a chapter of the book by Louden [].