Consider the following input string:

x+2*yWhen scanned by a scanner, it produces the following stream of tokens:

id(x) + num(2) * id(y)The goal is to parse this expression and construct a data structure (a parse tree) to represent it. One possible syntax for expressions is given by the following grammar G1:

E ::= E + T | E - T | T T ::= T * F | T / F | F F ::= num | idwhere

`E, T`

and `F`

stand for expression, term, and factor respectively.
For example, the rule for `E`

indicates that an expression `E`

can take one of the following 3 forms: an expression followed by the token
`+`

followed by a term, or an expression followed by the token
`-`

followed by a term, or simply a term. The first rule for `E`

is actually
a shorthand of 3 productions:
E ::= E + T E ::= E - T E ::= TG1 is an example of a context-free grammar (defined below); the symbols

`E, T`

and `F`

are `+, -, *, ?, num`

, and
`id`

are `E`

is the
In general, a *context-free grammar* (CFG) has a finite set of terminals (tokens),
a finite set of nonterminals from which one is the start symbol,
and a finite set of *productions* of the form:

A ::= X1 X2 ... Xnwhere

`A`

is a nonterminal and each `Xi`

is either a terminal or nonterminal symbol.
Given two sequences of symbols *a* and *b* (can be any combination of
terminals and nonterminals) and a production
*A* : : = *X*_{1}*X*_{2}...*X*_{n},
the form
*aAb* = > *aX*_{1}*X*_{2}...*X*_{n}*b* is called a *derivation*. That
is, the nonterminal symbol *A* is replaced by the rhs
(right-hand-side) of the production for *A*. For example,

T / F + 1 - x => T * F / F + 1 - xis a derivation since we used the production

`T := T * F`

.
*Top-down parsing* starts from the start symbol of the grammar *S* and
applies derivations until the entire input string is derived (ie, a
sequence of terminals that matches the input tokens). For example,

E => E + T => E + T * F => T + T * F => T + F * F => T + num * F => F + num * F => id + num * F => id + num * idwhich matches the input sequence

`id(x) + num(2) * id(y)`

. Top
down parsing occasionally requires backtracking. For example, suppose
the we used the derivation `E => E - T`

instead of the first
derivation. Then, later we would have to backtrack because the derived
symbols will not match the input tokens. This is true for all
nonterminals that have more than one production since it indicates
that there is a choice of which production to use. We will learn how
to construct parsers for many types of CFGs that never backtrack. These
parsers are based on a method called `T + F * F`

we have 3 choices: we can use
a derivation for `T`

, for the first `F`

, or for the second
`F`

. When we always replace the leftmost nonterminal, it is
called
In contrast to top-down parsing, *bottom-up parsing* starts from
the input string and uses derivations in the opposite directions (ie,
by replacing the rhs sequence
*X*_{1}*X*_{2}...*X*_{n} of a production
*A* : : = *X*_{1}*X*_{2}...*X*_{n} with the nonterminal *A*. It stops when it derives the
start symbol. For example,

id(x) + num(2) * id(y) => id(x) + num(2) * F => id(x) + F * F => id(x) + T * F => id(x) + T => F + T => T + T => E + T => E

The *parse tree* of an input sequence according to a CFG is the
tree of derivations. For example if we used a production
*A* : : = *X*_{1}*X*_{2}...*X*_{n} (in either top-down or bottom-up parsing) then we
construct a tree with node *A* and children
*X*_{1}*X*_{2}...*X*_{n}.
For example, the parse tree of `id(x) + num(2) * id(y)`

is:

E / | \ E + T | / | \ T T * F | | | F F id | | id numSo a parse tree has non-terminals for internal nodes and terminals for leaves. As another example, consider the following grammar:

S ::= ( L ) | a L ::= L , S | SUnder this grammar, the parse tree of the sentence (

S / | \ ( L ) / | \ L , S | / | \ S ( L ) | / | \ a L , S | | S a / | \ ( L ) / | \ L , S | | S a | a

Consider now the following grammar G2:

E ::= T + E | T - E | T T ::= F * T | F / T | F F ::= num | idThis is similar to our original grammar, but it is right associative when the leftmost derivation rules is used. That is,

`x-y-z`

is equivalent to `x-(y-z)`

under G2, as we can see from
its parse tree.
Consider now the following grammar G3:

E ::= E + E | E - E | E * E | E / E | num | idIs this grammar equivalent to our original grammar G1? Well, it recognizes the same language, but it constructs the wrong parse trees. For example, the

`x+y*z`

is interpreted as `(x+y)*z`

by this
grammar (if we use leftmost derivations) and as `x+(y*z)`

by G1
or G2. That is, both G1 and G2 grammar handle the operator
precedence correctly (since * has higher precedence than +), while
the G3 grammar does not.
In general, to write a grammar that handles
precedence properly, we can start with a grammar that does not handle
precedence at all, such as our last grammar G3, and then we can refine
it by creating more nonterminals, one for each group of operators that
have the same precedence. For example, suppose we want to parse
an expression *E* and we have 4 groups of operators: {*not*},
{*,/}, { + , - }, and
{*and*, *or*}, in this order of
precedence. Then we create 4 new nonterminals: *N*, *T*, *F*, and *B*
and we split the derivations for *E* into 5 groups of derivations (the
same way we split the rules for *E* in the last grammar into 3 groups
in the first grammar).

A grammar is *ambiguous* if it has more than one parse tree for
the same input sequence depending which derivations are applied each
time. For example, the grammar G3 is ambiguous since it has two
parse trees for `x-y-z`

(one parses `x-y`

first, while the
other parses `y-z`

first). Of course, the first one is the right
interpretation since - is left associative. Fortunately, if we
always apply the leftmost derivation rule, we will never derive the
second parse tree. So in this case the leftmost derivation rule
removes the ambiguity.

2002-08-26