Most parsers nowdays are specified using LALR(1) grammars fed to a parser generator, such as CUP. There are few tricks that we need to know to avoid reduce-reduce and shift-reduce conflicts.
When we learned about top-down predictive parsing, we were told that left recursion is bad. For LALR(1) the opposite is true: left recursion is good, right recursion is bad. Consider this:
L ::= id , L | idwhich captures a list of ids separated by commas. If the FOLLOW of
L
(or to be more precise, the expected lookaheads for L ::= id
) contains
the comma token, we are in big trouble. We can use instead:
L ::= L , id | id
Left recursion uses less stack than right recursion and it also produces left associative trees (which is what we usually want).
Some shift-reduce conflicts are very difficult to eliminate. Consider the infamous if-then-else conflict:
S ::= if E then S else S | if E then S | ...
Read Section 3.3 in the textbook to see how this can be eliminated.
Most shift-reduce errors though are easy to remove by assigning precedence and associativity to operators. Consider the grammar:
S ::= E $ E ::= E + E | E * E | ( E ) | id | numand four of its LALR(1) states:
I0: S ::= . E $ ? E ::= . E + E +*$ I1: S ::= E . $ ? I2: E ::= E * . E +*$ E ::= . E * E +*$ E ::= E . + E +*$ E ::= . E + E +*$ E ::= . ( E ) +*$ E ::= E . * E +*$ E ::= . E * E +*$ E ::= . id +*$ E ::= . ( E ) +*$ E ::= . num +*$ I3: E ::= E * E . +*$ E ::= . id +*$ E ::= E . + E +*$ E ::= . num +*$ E ::= E . * E +*$
Here we have a shift-reduce error. Consider the first two items in
I3
. If we have a*b+c
and we parsed a*b
, do we
reduce using E ::= E * E
or do we shift more symbols? In the
former case we get a parse tree (a*b)+c
; in the latter case we
get a*(b+c)
. To resolve this conflict, we can specify that
*
has higher precedence than +
. The precedence of a
grammar production is equal to the precedence of the rightmost token
at the rhs of the production. For example, the precedence of the
production E ::= E * E
is equal to the precedence of the
operator *
, the precedence of the production E ::= ( E )
is equal to the precedence of the token )
, and the precedence
of the production E ::= if E then E else E
is equal to the
precedence of the token else
. The idea is that if the
lookahead has higher precedence than the production currently used, we
shift. For example, if we are parsing E + E
using the
production rule E ::= E + E
and the lookahead is *
, we
shift *
. If the lookahead has the same precedence as that of
the current production and is left associative, we reduce, otherwise
we shift. The above grammar is valid if we define the precedence and
associativity of all the operators. Thus, it is very important when
you write a parser using CUP or any other LALR(1) parser generator to
specify associativities and precedences for most tokens (especially
for those used as operators). Note: you can explicitly define
the precedence of a rule in CUP using the %prec
directive:
E ::= MINUS E %prec UMINUSwhere
UMINUS
is a pseudo-token that has higher precedence than
TIMES
, MINUS
etc, so that -1*2
is equal to
(-1)*2
, not to -(1*2)
.
Another thing we can do when specifying an LALR(1) grammar for a parser generator is error recovery. All the entries in the ACTION and GOTO tables that have no content correspond to syntax errors. The simplest thing to do in case of error is to report it and stop the parsing. But we would like to continue parsing finding more errors. This is called error recovery. Consider the grammar:
S ::= L = E ; | { SL } ; | error ; SL ::= S ; | SL S ;
The special token error
indicates to the parser what to do in
case of invalid syntax for S
(an invalid statement). In this
case, it reads all the tokens from the input stream until it finds the
first semicolon. The way the parser handles this is to first push an
error
state in the stack. In case of an error, the parser pops
out elements from the stack until it finds an error state where it can
proceed. Then it discards tokens from the input until a restart is
possible. Inserting error handling productions in the proper places
in a grammar to do good error recovery is considered very hard.