The goal of predictive parsing is to construct a top-down parser that never backtracks. To do so, we must transform a grammar in two ways:
Consider this grammar:
A ::= A a | bIt recognizes the regular expression ba^{*}. The problem is that if we use the first production for top-down derivation, we will fall into an infinite derivation chain. This is called left recursion. But how else can you express ba^{*}? Here is an alternative way:
A ::= b A' A' ::= a A' |where the third production is an empty production (ie, it is
A' ::=
). That is, A' parses the RE a^{*}. Even though this
CFG is recursive, it is not left recursive. In general, for each
nonterminal X, we partition the productions for X into two groups:
one that contains the left recursive productions, and the other with
the rest. Suppose that the first group is:
X ::= X a1 ... X ::= X anwhile the second group is:
X ::= b1 ... X ::= bmwhere
a, b
are symbol sequences.
Then we eliminate the left recursion by rewriting these rules into:
X ::= b1 X' ... X ::= bm X' X' ::= a1 X' ... X' ::= an X' X' ::=For example, the CFG G1 is transformed into:
E ::= T E' E' ::= + T E' | - T E' | T ::= F T' T' ::= * F T' | / F T' | F ::= num | id
Suppose now that we have a number of productions for X that have a common prefix in their rhs (but without any left recursion):
X ::= a b1 ... X ::= a bnWe factor out the common prefix as follows:
X ::= a X' X' ::= b1 ... X' ::= bnThis is called left factoring and it helps predict which rule to use without backtracking. For example, the rule from our right associative grammar G2:
E ::= T + E | T - E | Tis translated into:
E ::= T E' E' ::= + E | - E |
As another example, let L be the language of all regular expressions over the alphabet = {a, b}. That is, L = {``",``a",``b",``a*",``b*",``a| b",``(a| b)",...}. For example, the string ``a(a| b)*| b*" is a member of L. There is no RE that captures the syntax of all REs. Consider for example the RE ((^{ ... }(a)^{ ... })), which is equivalent to the language (^{n}a)^{n} for all n. This represents a valid RE but there is no RE that can capture its syntax. A context-free grammar that recognizes L is:
R | : : = | R R |
| | R ``|" R | |
| | R * | |
| | ( R ) | |
| | a | |
| | b | |
| | ``" |
R | : : = | ( R ) R' |
| | a R' | |
| | b R' | |
| | ``" R' | |
R' | : : = | R R' |
| | ``|" R R' | |
| | * R' | |
| |