In formal language theory, parsing - or the so called syntactic analysis - is a process when the parser determines if a given string can be generated by a given grammar. This is very important for compilers and interpreters. For example, it is not too difficult to create a context-free grammar GP generating all syntactically correct Pascal programs. Then, we can use a parser to decide - about a Pascal program written by a programmer - if the program is in the generated language L(GP). When the program is in the generated language, it is syntactically correct.
We have a given Chomsky normal form grammar G = (N, T, S, P) and a word p = a1a2... an. The Cocke-Younger-Kasami algorithm is a well known method to decide wether p ∈ L(G) or p ∉ L(G). To make our decision, we have to fill out an n × n size triangular matrix M in the following way: Over the cells of the first line, we write the letters a1,a2,...,an, starting from the first letter, one after the other. Then, the cell M(i,j) contains each nonterminal symbol A, if and only if the subword ajaj+1...aj+i-1 can be derived from A. (Formally: A ∈ M(i,j) if and only if A →* ajaj+1...aj+i-1.) This means that the first cell of the first line contains the nonterminal A, if and only if A → a1 ∈ P. The cell M(1,j) contains the nonterminal A, if and only if A → aj ∈ P. It is also quite easy to fill out the cells of the second line of the matrix. The nonterminal A is in the cell M(2,j), if and only if there exists nonterminals B, C ∈ N such that B ∈ M(1,j), C ∈ M(1,j+1), and A → BC ∈ P. From this point the algorithm becomes more complex. From the third line, we use the following formula: A ∈ M(i,j), if and only if there exists nonterminals B, C ∈ N and integer k such that B ∈ M(k,j), C ∈ M(i-k, j+k) and A → BC ∈ P. This algorithm is finished when the cell M(n,1) is filled out. Remember, the nonterminal A is in the cell M(i,j), if and only if the word ajaj+1...aj+i-1 can be derived from A. This means that the nonterminal S is in the cell M(n,1), if and only if the word a1a2...an can be derived from S. So the grammar G generates the word p, if and only if the cell M(n,1) contains the start symbol S.
Example 45. In this example, we use the CYK algorithm to show that the grammar G generates the word abbaa.
G = ({S, A, B}, {a,b}, S, P) P = { S → SA, S → AB, A → BS, B → SA, S → a, A → a, B → b }
As you can see, S ∈ M(5,1), so abbaa ∈ L(G).
Exercise 60. Use the CYK algorithm to show that the grammar G generates the word baabba.
G = ({S, A, B, X, Y, Z}, {a,b}, S, P) P ={ S → AY, Y → XB, X → BA, X → ZA, Z → BX, A → b, B → a }
Exercise 61. Use the CYK algorithm to decide if the word cbacab can be generated by grammar G.
G = (S, A, B, C, D}, {a,b,c}, S, P) P = { S → AB, A → CA, A → SS, B → CD, A → b, C → a, C → b, D → c }
The Earley algorithm is designed to decide if a context-free grammar generates a terminal word. Sometimes it is not comfortable to create and use an equivalent Chomsky normal form grammar for a λ-free context-free grammar, because the Chomsky normal form grammar could have many more production rules than the original grammar. This is why the Earley algorithm is more widely used than the CYK algorithm for computerized lexical analysis. Although the Earley algorithm looks more complicated for humans, - and actually, it is more complicated compared to the the very simple CYK algorithm, - but after the implementation, there is no difference between the complexity of the two algorithms for computers.
Now we are going to show the steps of the λ-free version of the Earley algorithm. It can work with rules having form A → λ as well, with minor modification, but in practice we do not need the extended version.
Let G = (N, T, S, P) be a λ-free, context-free grammar, and p = a1a2... an ∈ T+, with integer n > 0. We are going to fill out the cells of an (n+1) × (n+1) triangular matrix M, except for the last cell M(n,n). Over the cells of the first line of the matrix, we write the letters a1, a2,..., an, starting from the second cell and first letter, one after the other. The elements of the matrix are production rules from P, where the right hand side of each rule contains a dot character.
The steps of the algorithm are the following:
Let S → .q ∈ M(0,0) if S → q ∈ P, and let j = 0.
Let A → .q ∈ M(j,j) if A → q ∈ P, and there exists an integer k≤ j such that B → r.At ∈ M(k,j).
Let j = j+1 and let i = j-1.
Let A → raj.t ∈ M(i,j) if A → r.ajt ∈ M(i,j-1).
Let A → rB.t ∈ M(i,j) if there exists an integer i ≤ k < j such that A → r.Bt ∈ M(i,k), and B → q. ∈ M(k,j).
If i > 0 then i = i-1 and goto 4.
If i = 0 and j < n then goto 2.
If i = 0 and j = n then finished.
Here q ∈ (T ∪ N)+, A, B ∈ N, r, t ∈ (T ∪ N)*, and of course i,j,k are integers.
Grammar G generates the word p (p ∈ L(G)), if and only if there is a production rule in M(0,n), whose left hand side is the start symbol S, and there is a dot at the end of the right hand side of the rule.
Examplee 46. In this example, we have a λ-free context-free grammar G, and we have to decide if the word a*a+a can be generated by this grammar.
G = ({S, A, B}, {a,+,*,(,)}, S, P) P = { S → S+A A → A*B B → (S) S → A A → B B → a }
As you can see, the top right cell contains a rule, whose left hand side is the start symbol S, and there is a dot at the end of the right hand side of the rule, so a+a*a ∈ L(G).
Exercise 62. Use the Earley algorithm to decide if the word 100110 can be generated by grammar G.
G = ({S, A, B}, {0,1}, S, P) P = { S → 0A1, S → 1B0, A → B1, B → S1, A → 0, B → 1 }
Exercise 63. Use the Earley algorithm to decide if the word bbabb can be generated by grammar G.
G = ({S, A, B}, {a,b}, S, P) P = { S → BA B → bAB, A → BAb, B → SbA, A → a, B → b }