2.2. 2.2 Lexical analysis

The source-handler transforms the source program into a character stream. The main task of lexical analyser (scanner) is recognising the symbolic units in this character stream. These symbolic units are named symbols.

Unfortunately, in different programming languages the same symbolic units consist of different character streams, and different symbolic units consist of the same character streams. For example, there is a programming language in which the 1. and .10 characters mean real numbers. If we concatenate these symbols, then the result is the 1..10 character stream. The fact, that a sign of an algebraic function is missing between the two numbers, will be detected by the next analyser, doing syntactic analysis. However, there are programming languages in which this character stream is decomposited into three components: 1 and 10 are the lower and upper limits of an interval type variable.

The lexical analyser determines not only the characters of a symbol, but the attributes derived from the surrounded text. Such attributes are, e.g., the type and value of a symbol.

The scanner assigns codes to the symbols, same codes to the same sort of symbols. For example the code of all integer numbers is the same; another unique code is assigned to variables.

The lexical analyser transforms the character stream into the series of symbol codes and the attributes of a symbols are written in this series, immediately after the code of the symbol concerned.

The output information of the lexical analyser is not “readable”: it is usually a series of binary codes. We note that, in the viewpoint of the compiler, from this step of the compilation it is no matter from which characters were made the symbol, i.e. the code of the if symbol was made form English if or Hungarian ha or German wenn characters. Therefore, for a program language using English keywords, it is easy to construct another program language using keywords of another language. In the compiler of this new program language the lexical analysis would be modified only, the other parts of the compiler are unchanged.

2.2.1. 2.2.1 The automaton of the scanner

The exact definition of symbolic units would be given by regular grammar, regular expressions or deterministic finite automaton. The theories of regular grammars, regular expressions and deterministic finite automata were studied in previous chapters.

Practically the lexical analyser may be a part of the syntactic analysis. The main reason to distinguish these analysers is that a lexical analyser made from regular grammar is much more simpler than a lexical analyser made from a context-free grammar. Context-free grammars are used to create syntactic analysers.

One of the most popular methods to create the lexical analyser is the following:

  1. describe symbolic units in the language of regular expressions, and from this information construct the deterministic finite automaton which is equivalent to these regular expressions,

  2. implement this deterministic finite automaton.

We note that, in writing of symbols regular expressions are used, because they are more comfortable and readable then regular grammars. There are standard programs as the lex of UNIX systems, that generate a complete syntactical analyser from regular expressions. Moreover, there are generator programs that give the automaton of scanner, too.

A very trivial implementation of the deterministic finite automaton uses multidirectional cASE instructions. The conditions of the branches are the characters of state transitions, and the instructions of a branch represent the new state the automaton reaches when it carries out the given state transition.

The main principle of the lexical analyser is building a symbol from the longest series of symbols. For example the string ABC is a three-letters symbol, rather than three one-letter symbols. This means that the alternative instructions of the cASE branch read characters as long as they are parts of a constructed symbol.

Functions can belong to the final states of the automaton. For example, the function converts constant symbols into an inner binary forms of constants, or the function writes identifiers to the symbol table.

The input stream of the lexical analyser contains tabulators and space characters, since the source-handler expunges the carriage return and line feed characters only. In most programming languages it is possible to write a lot of spaces or tabulators between symbols. In the point of view of compilers these symbols have no importance after their recognition, hence they have the name white spaces.

Expunging white spaces is the task of the lexical analyser. The description of the white space is the following regular expression:

where space and the tab tabulator are the characters which build the white space symbols and is the symbol for the or function. No actions have to make with this white space symbols, the scanner does not pass these symbols to the syntactic analyser.

Some examples for regular expression:

Example 2.1 Introduce the following notations: Let be an arbitrary digit, and let be an arbitrary letter,

the not-visible characters are denoted by their short names, and let be the name of the empty character stream. denotes a character distinct from . The regular expressions are:

  1. real number: ,

  2. positive integer and real number: ,

  3. identifier: ,

  4. comment: ,

  5. comment terminated by : ,

  6. string of characters: .

Figure 2.4.  The positive integer and real number.

The positive integer and real number.


Deterministic finite automata constructed from regular expressions 2 and 3 are in Figures 2.4 and 2.5.

Figure 2.5.  The identifier.

The identifier.


The task of lexical analyser is to determine the text of symbols, but not all the characters of a regular expression belong to the symbol. As is in the 6th example, the first and the last " characters do not belong to the symbol. To unravel this problem, a buffer is created for the scanner. After recognising of a symbol, the characters of these symbols will be in the buffer. Now the deterministic finite automaton is supplemented by a transfer function, where means that the character is inserted into the buffer.

Figure 2.6.  A comment.

A comment.

Example 2.2 The 4th and 6th regular expressions of the example 2.1 are supplemented by the function, automata for these expressions are in Figures 2.6 and 2.7. The automaton of the 4th regular expression has none function, since it recognises comments. The automaton of the 6th regular expression recognises This is a “string” from the character string ”This is a “”string”””.

Figure 2.7.  The character string.

The character string.


Now we write the algorithm of the lexical analyser given by deterministic finite automaton. (The state of the set of one element will be denoted by the only element of the set).

Let be the deterministic finite automaton, which is the scanner. We augment the alphabet with a new notion: let others be all the characters not in . Accordingly, we modify the transition function :

The algorithm of parsing, using the augmented automaton , follows:

Lex-analyse( )

  1  ,  first character of    2     3  
                        WHILE
                      
                      and    4    
                        DO
                      
                     
                        IF
                      
                        5       
                        THEN
                      
                        6           next character of    7       
                        ELSE
                      
                        8  
                        IF
                      
                      and    9    
                        THEN
                      
                       10  
                        ELSE
                      
                       11  
                        RETURN
                      
                      
                  

The algorithm has two parameters: the first one is the input character string terminated by , the second one is the automaton of the scanner. In the line 1 the state of the scanner is set to , to the start state of the automaton, and the first character of the input string is determined. The variable indicates that the algorithm is analysing the input string, the text analysing is set in this variable in the line 2. In the line 5 a state-transition is executed. It can be seen that the above augmentation is needed to terminate in case of unexpected, invalid character. In line 8–10 the O.K. means that the analysed character string is correct, and the ERROR signs that a lexical error was detected. In the case of successful termination the variable contains the character, at erroneous termination it contains the invalid character.

We note that the algorithm LEX-ANALYSE recognise one symbol only, and then it is terminated. The program written in a programming language consists of a lot of symbols, hence after recognising a symbol, the algorithm have to be continued by detecting the next symbol. The work of the analyser is restarted at the state of the automaton. We propose the full algorithm of the lexical analyser as an exercise (see Problem 2-1).

Example 2.3 The automaton of the identifier in the point 3 of example 2.1 is in Figure 2.5. The start state is 0, and the final state is 1. The transition function of the automaton follows:

The augmented transition function of the automaton:

The algorithm LEX-ANALYSE gives the series of states and sign O.K. to the input string abc123#, it gives sign ERROR to the input sting 9abc#, and the series and sign ERROR to the input string abc 123.

2.2.2. 2.2.2 Special problems

In this subsection we investigate the problems emerged during running of lexical analyser, and supply solutions for these problems.

2.2.2.1.  Keywords, standard words.

All of programming languages allows identifiers having special names and predefined meanings. They are the keywords. Keywords are used only in their original notions. However there are identifiers which also have predefined meaning but they are alterable in the programs. These words are called standard words.

The number of keywords and standard words in programming languages are vary. For example, there is a program language, in which three keywords are used for the zero value: zero, zeros Ús zeroes.

Now we investigate how does the lexical analyser recognise keywords and standard words, and how does it distinguish them from identifiers created by the programmers.

The usage of a standard word distinctly from its original meaning renders extra difficulty, not only to the compilation process but also to the readability of the program, such as in the next example:

if if then else = then;

or if we declare procedures which have names begin and end:

       begin          begin; begin end; end; begin end;        end;

Recognition of keywords and standard words is a simple task if they are written using special type characters (for example bold characters), or they are between special prefix and postfix characters (for example between apostrophes).

We give two methods to analyse keywords.

  1. All keywords is written as a regular expression, and the implementation of the automaton created to this expression is prepared. The disadvantage of this method is the size of the analyser program. It will be large even if the description of keywords, whose first letter are the same, are contracted.

  2. Keywords are stored in a special keyword-table. The words can be determined in the character stream by a general identifier- recogniser. Then, by a simple search algorithm, we check whether this word is in the keyword- table. If this word is in the table then it is a keyword. Otherwise it is an identifier defined by the user. This method is very simple, but the efficiency of search depends on the structure of keyword-table and on the algorithm of search. A well-selected mapping function and an adequate keyword-table should be very effective.

If it is possible to write standard words in the programming language, then the lexical analyser recognises the standard words using one of the above methods. But the meaning of this standard word depends of its context. To decide, whether it has its original meaning or it was overdefined by the programmer, is the task of syntactic analyser.

2.2.2.2.  Look ahead.

Since the lexical analyser creates a symbol from the longest character stream, the lexical analyser has to look ahead one or more characters for the allocation of the right-end of a symbol. There is a classical example for this problem, the next two FORTRAN statements:

DO 10 I = 1.1000

DO 10 I = 1,1000

In the FORTRAN programming language space-characters are not important characters, they do not play an important part, hence the character between 1 and 1000 decides that the statement is a DO cycle statement or it is an assignment statement for the DO10I identifier.

To sign the right end of the symbol, we introduce the symbol / into the description of regular expressions. Its name is lookahead operator. Using this symbol the description of the above DO keyword is the next

This definition means that the lexical analyser says that the first two D and O letters are the DO keyword, if looking ahead, after the O letter, there are letters or digits, then there is an equal sign, and after this sign there are letters or digits again, and finally, there is a “,” character. The lookahead operator implies that the lexical analyser has to look ahead after the DO characters. We remark that using this lookahead method the lexical analyser recognises the DO keyword even if there is an error in the character stream, such as in the DO2A=3B, character stream, but in a correct assignment statement it does not detect the DO keyword.

In the next example we concern for positive integers. The definition of integer numbers is a prefix of the definition of the real numbers, and the definition of real numbers is a prefix of the definition of real numbers containing explicit power-part.

The automaton for all of these three expressions is the automaton of the longest character stream, the real number containing explicit power-part.

The problem of the lookahead symbols is resolved using the following algorithm. Put the character into a buffer, and put an auxiliary information aside this character. This information is “it is invalid”. If the character string, using this red character, is not correct; otherwise we put the type of the symbol into here. If the automaton is in a final-state, then the automaton recognises a real number with explicit power-part. If the automaton is in an internal state, and there is no possibility to read a next character, then the longest character stream which has valid information is the recognised symbol.

Example 2.4 Consider the 12.3e+f# character stream, where the character # is the endsign of the analysed text. If in this character stream there was a positive integer number in the place of character f, then this character stream should be a real number. The content of the puffer of lexical analyser:

The recognised symbol is the 12.3 real number. The lexical analysing is continued at the text e+f.

The number of lookahead-characters may be determined from the definition of the program language. In the modern languages this number is at most two.

2.2.2.3.  The symbol table.

There are programming languages, for example C, in which small letters and capital letters are different. In this case the lexical analyser uses characters of all symbols without modification. Otherwise the lexical analyser converts all characters to their small letter form or all characters to capital letter form. It is proposed to execute this transformation in the source handler program.

At the case of simpler programming languages the lexical analyser writes the characters of the detected symbol into the symbol table, if this symbol is not there. After writing up, or if this symbol has been in the symbol table already, the lexical analyser returns the table address of this symbol, and writes this information into its output. These data will be important at semantic analysis and code generation.

2.2.2.4.  Directives.

In programming languages the directives serve to control the compiler. The lexical analyser identifies directives and recognises their operands, and usually there are further tasks with these directives.

If the directive is the if of the conditional compilation, then the lexical analyser has to detect all of parameters of this condition, and it has to evaluate the value of the branch. If this value is false, then it has to omit the next lines until the else or endif directive. It means that the lexical analyser performs syntactic and semantic checking, and creates code-style information. This task is more complicate if the programming language gives possibility to write nested conditions.

Other types of directives are the substitution of macros and including files into the source text. These tasks are far away from the original task of the lexical analyser.

The usual way to solve these problems is the following. The compiler executes a pre-processing program, and this program performs all of the tasks written by directives.

Exercises

2.2-1 Give a regular expression to the comments of a programming language. In this language the delimiters of comments are and , and inside of a comment may occurs and characters, but is forbidden.

2.2-2 Modify the result of the previous question if it is supposed that the programming language has possibility to write nested comments.

2.2-3 Give a regular expression for positive integer numbers, if the pre- and post-zero characters are prohibited. Give a deterministic finite automaton for this regular expression.

2.2-4 Write a program, which re-creates the original source program from the output of lexical analyser. Pay attention for nice an correct positions of the re-created character streams.