3.3.5 Practical Considerations for LALR(1) Grammars

Next: 3.4 Case Study: The Up: 3.3 Bottom-up Parsing Previous: 3.3.4 SLR(1), LR(1), and Contents

3.3.5 Practical Considerations for LALR(1) Grammars

Most parsers nowdays are specified using LALR(1) grammars fed to a parser generator, such as CUP. There are few tricks that we need to know to avoid reduce-reduce and shift-reduce conflicts.

When we learned about top-down predictive parsing, we were told that left recursion is bad. For LALR(1) the opposite is true: left recursion is good, right recursion is bad. Consider this:

L ::= id , L
    | id

which captures a list of ids separated by commas. If the FOLLOW of L (or to be more precise, the expected lookaheads for L ::= id) contains the comma token, we are in big trouble. We can use instead:

L ::= L , id
    | id

Left recursion uses less stack than right recursion and it also produces left associative trees (which is what we usually want).

Some shift-reduce conflicts are very difficult to eliminate. Consider the infamous if-then-else conflict:

S ::= if E then S else S
    | if E then S
    | ...

Read Section 3.3 in the textbook to see how this can be eliminated.

Most shift-reduce errors though are easy to remove by assigning precedence and associativity to operators. Consider the grammar:

S ::= E $
E ::= E + E
    | E * E
    | ( E )
    | id
    | num

and four of its LALR(1) states:

I0:  S ::= . E $    ?
     E ::= . E + E  +*$    I1:  S ::= E . $    ?      I2:  E ::= E * . E   +*$
     E ::= . E * E  +*$         E ::= E . + E  +*$         E ::= . E + E   +*$
     E ::= . ( E )  +*$         E ::= E . * E  +*$         E ::= . E * E   +*$
     E ::= . id     +*$                                    E ::= . ( E )   +*$
     E ::= . num    +*$    I3:  E ::= E * E .   +*$        E ::= . id      +*$
                                E ::= E . + E   +*$        E ::= . num     +*$
                                E ::= E . * E   +*$

Here we have a shift-reduce error. Consider the first two items in I3. If we have a*b+c and we parsed a*b, do we reduce using E ::= E * E or do we shift more symbols? In the former case we get a parse tree (a*b)+c; in the latter case we get a*(b+c). To resolve this conflict, we can specify that * has higher precedence than +. The precedence of a grammar production is equal to the precedence of the rightmost token at the rhs of the production. For example, the precedence of the production E ::= E * E is equal to the precedence of the operator *, the precedence of the production E ::= ( E ) is equal to the precedence of the token ), and the precedence of the production E ::= if E then E else E is equal to the precedence of the token else. The idea is that if the lookahead has higher precedence than the production currently used, we shift. For example, if we are parsing E + E using the production rule E ::= E + E and the lookahead is *, we shift *. If the lookahead has the same precedence as that of the current production and is left associative, we reduce, otherwise we shift. The above grammar is valid if we define the precedence and associativity of all the operators. Thus, it is very important when you write a parser using CUP or any other LALR(1) parser generator to specify associativities and precedences for most tokens (especially for those used as operators). Note: you can explicitly define the precedence of a rule in CUP using the %prec directive:

 E ::= MINUS E   %prec UMINUS

where UMINUS is a pseudo-token that has higher precedence than TIMES, MINUS etc, so that -1*2 is equal to (-1)*2, not to -(1*2).

Another thing we can do when specifying an LALR(1) grammar for a parser generator is error recovery. All the entries in the ACTION and GOTO tables that have no content correspond to syntax errors. The simplest thing to do in case of error is to report it and stop the parsing. But we would like to continue parsing finding more errors. This is called error recovery. Consider the grammar:

S ::= L = E ;
    | { SL } ;
    | error ;
SL ::= S ;
     | SL S ;

The special token error indicates to the parser what to do in case of invalid syntax for S (an invalid statement). In this case, it reads all the tokens from the input stream until it finds the first semicolon. The way the parser handles this is to first push an error state in the stack. In case of an error, the parser pops out elements from the stack until it finds an error state where it can proceed. Then it discards tokens from the input until a restart is possible. Inserting error handling productions in the proper places in a grammar to do good error recovery is considered very hard.

Next: 3.4 Case Study: The Up: 3.3 Bottom-up Parsing Previous: 3.3.4 SLR(1), LR(1), and Contents

2015-01-20