Most parsers nowdays are specified using LALR(1) grammars fed to a parser generator, such as Bison. There are few tricks that we need to know to avoid reduce-reduce and shift-reduce conflicts.
When we learned about top-down predictive parsing, we were told that left recursion is bad. For LALR(1) the opposite is true: left recursion is good, right recursion is bad. Consider this:
L ::= id , L
| id
which captures a list of ids separated by commas. If the FOLLOW of L
(or to be more precise, the expected lookaheads for L ::= id) contains
the comma token, we are in big trouble. We can use instead:
L ::= L , id
| id
Left recursion uses less stack than right recursion and it also produces left associative trees (which is what we usually want).
Some shift-reduce conflicts are very difficult to eliminate. Consider the infamous if-then-else conflict:
S ::= if E then S else S
| if E then S
| ...
Read page 68 in the textbook to see how this can be eliminated.
Most shift-reduce errors though are easy to remove by assigning precedence and associativity to operators. Consider the grammar:
S ::= E $
E ::= E + E
| E * E
| ( E )
| id
| num
and four of its LALR(1) states:
I0: S ::= . E $ ?
E ::= . E + E +*$ I1: S ::= E . $ ? I2: E ::= E * . E +*$
E ::= . E * E +*$ E ::= E . + E +*$ E ::= . E + E +*$
E ::= . ( E ) +*$ E ::= E . * E +*$ E ::= . E * E +*$
E ::= . id +*$ E ::= . ( E ) +*$
E ::= . num +*$ I3: E ::= E * E . +*$ E ::= . id +*$
E ::= E . + E +*$ E ::= . num +*$
E ::= E . * E +*$
Here we have a shift-reduce error. Consider the first two items in
I3. If we have a*b+c and we parsed a*b, do we
reduce using E ::= E * E or do we shift more symbols? In the
former case we get a parse tree (a*b)+c; in the latter case we
get a*(b+c). To resolve this conflict, we can specify that
* has higher precedence than +. The idea is that if the
lookahead has higher precedence than the operator in the production
currently used, we shift. For example, if we are parsing E + E
using the production rule E ::= E + E and the lookahead is
*, we shift *. If the lookahead has the same precedence
and is left associative, we reduce. The above grammar is LALR(1) if
we define the precedence and associativity of all the operators.
Thus, it is very important when you write a parser using Bison or
any other LALR(1) parser generator to specify associativities
and precedences for most tokens (especially for those used as operators).
Another thing to do when specifying an LALR(1) grammar for a parser generator is error recovery. All the entries in the ACTION and GOTO tables that have no content correspond to syntax errors. The simplest thing to do in case of error is to report it and stop the parsing. But we would like to continue parsing finding more errors. This is called error recovery. Consider the grammar:
S ::= L = E ;
| { SL } ;
| error ;
SL ::= S ;
| SL S ;
The special token error indicates to the parser what to do in
case of invalid syntax for S (an invalid statement). In this
case, it reads all the tokens from the input stream until it finds the
first semicolon. The way the parser handles this is to first push an
error state in the stack. In case of an error, the parser pops
out elements from the stack until it finds an error state where it can
proceed. Then it discards tokens from the input until a restart is
possible. Inserting error handling productions in the proper places
in a grammar to do good error recovery is considered very hard.