3.2 Predictive Parsing

Next: 3.2.1 Recursive Descent Parsing Up: 3 Parsing Previous: 3.1 Context-free Grammars Contents

3.2 Predictive Parsing

The goal of predictive parsing is to construct a top-down parser that never backtracks. To do so, we must transform a grammar in two ways:

eliminate left recursion, and
perform left factoring.

These rules eliminate most common causes for backtracking although they do not guarantee a completely backtrack-free parsing (called LL(1) as we will see later).

Consider this grammar:

A ::= A a
    | b

It recognizes the regular expression ba^*. The problem is that if we use the first production for top-down derivation, we will fall into an infinite derivation chain. This is called left recursion. But how else can you express ba^*? Here is an alternative way:

A ::= b A'
A' ::= a A'
     |

where the third production is an empty production (ie, it is A' ::= ). That is, A' parses the RE a^*. Even though this CFG is recursive, it is not left recursive. In general, for each nonterminal X, we partition the productions for X into two groups: one that contains the left recursive productions, and the other with the rest. Suppose that the first group is:

X ::= X a1
...
X ::= X an

while the second group is:

X ::= b1
...
X ::= bm

where a, b are symbol sequences. Then we eliminate the left recursion by rewriting these rules into:

X ::= b1 X'
...
X ::= bm X'
X' ::= a1 X'
...
X' ::= an X'
X' ::=

For example, the CFG G1 is transformed into:

E ::= T E'
E' ::= + T E'
     | - T E'
     |
T ::= F T'
T' ::= * F T'
     | / F T'
     |
F ::= num
    | id

Suppose now that we have a number of productions for X that have a common prefix in their rhs (but without any left recursion):

X ::= a b1
...
X ::= a bn

We factor out the common prefix as follows:

X ::= a X'
X' ::= b1
...
X' ::= bn

This is called left factoring and it helps predict which rule to use without backtracking. For example, the rule from our right associative grammar G2:

E ::= T + E
    | T - E
    | T

is translated into:

E ::= T E'
E' ::= + E
     | - E
     |

As another example, let L be the language of all regular expressions over the alphabet $\Sigma$ = {a, b}. That is, L = {`` $\varepsilon$ ",``a",``b",``a*",``b*",``a| b",``(a| b)",...}. For example, the string ``a(a| b)*| b*" is a member of L. There is no RE that captures the syntax of all REs. Consider for example the RE ((^...(a)^...)), which is equivalent to the language (ⁿa)ⁿ for all n. This represents a valid RE but there is no RE that can capture its syntax. A context-free grammar that recognizes L is:

R	: : =	R R
	\|	R ``\|" R
	\|	R *
	\|	( R )
	\|	a
	\|	b
	\|	`` $\displaystyle \varepsilon$ "

after elimination of left recursion, this grammar becomes:

R	: : =	( R ) R'
	\|	a R'
	\|	b R'
	\|	`` $\displaystyle \varepsilon$ " R'
R'	: : =	R R'
	\|	``\|" R R'
	\|	* R'
	\|

Subsections

Next: 3.2.1 Recursive Descent Parsing Up: 3 Parsing Previous: 3.1 Context-free Grammars Contents

2015-01-20