next up previous contents
Next: 3.2.1 Recursive Descent Parsing Up: 3 Parsing Previous: 3.1 Context-free Grammars

3.2 Predictive Parsing

The goal of predictive parsing is to construct a top-down parser that never backtracks. To do so, we must transform a grammar in two ways:

  1. eliminate left recursion, and
  2. perform left factoring.
These rules eliminate most common causes for backtracking although they do not guarantee a completely backtrack-free parsing (called LL(1) as we will see later).

Consider this grammar:

A ::= A a
    | b
It recognizes the regular expression ba*. The problem is that if we use the first production for top-down derivation, we will fall into an infinite derivation chain. This is called left recursion. But how else can you express ba*? Here is an alternative way:
A ::= b A'
A' ::= a A'
where the third production is an empty production (ie, it is A' ::= ). That is, A' parses the RE a*. Even though this CFG is recursive, it is not left recursive. In general, for each nonterminal X, we partition the productions for X into two groups: one that contains the left recursive productions, and the other with the rest. Suppose that the first group is:
X ::= X a1
X ::= X an
while the second group is:
X ::= b1
X ::= bm
where a, b are symbol sequences. Then we eliminate the left recursion by rewriting these rules into:
X ::= b1 X'
X ::= bm X'
X' ::= a1 X'
X' ::= an X'
X' ::=
For example, the CFG G1 is transformed into:
E ::= T E'
E' ::= + T E'
     | - T E'
T ::= F T'
T' ::= * F T'
     | / F T'
F ::= num
    | id

Suppose now that we have a number of productions for X that have a common prefix in their rhs (but without any left recursion):

X ::= a b1
X ::= a bn
We factor out the common prefix as follows:
X ::= a X'
X' ::= b1
X' ::= bn
This is called left factoring and it helps predict which rule to use without backtracking. For example, the rule from our right associative grammar G2:
E ::= T + E
    | T - E
    | T
is translated into:
E ::= T E'
E' ::= + E
     | - E

As another example, let L be the language of all regular expressions over the alphabet $ \Sigma$ = {a, b}. That is, L = {``$\varepsilon$",``a",``b",``a*",``b*",``a| b",``(a| b)",...}. For example, the string ``a(a| b)*| b*" is a member of L. There is no RE that captures the syntax of all REs. Consider for example the RE (( ... (a) ... )), which is equivalent to the language (na)n for all n. This represents a valid RE but there is no RE that can capture its syntax. A context-free grammar that recognizes L is:

R : : = R R
  | R ``|" R
  | R *
  | R )
  | a
  | b
  | ``$\displaystyle \varepsilon$"

after elimination of left recursion, this grammar becomes:

R : : = R ) R'
  | a R'
  | b R'
  | ``$\displaystyle \varepsilon$R'
R' : : = R R'
  | ``|" R R'
  | R'

next up previous contents
Next: 3.2.1 Recursive Descent Parsing Up: 3 Parsing Previous: 3.1 Context-free Grammars
Leonidas Fegaras