3.3.4 SLR(1), LR(1), and LALR(1) Grammars

Next: 3.3.5 Practical Considerations for Up: 3.3 Bottom-up Parsing Previous: 3.3.3 Table Construction Contents

3.3.4 SLR(1), LR(1), and LALR(1) Grammars

Here is an example of a grammar that is not LR(0):

1)  S ::= E $
2)  E ::= E + T
3)      | T
4)  T ::= T * F
5)      | F
6)  F ::= id
7)     | ( E )

Let's focus on two item sets only:

I0:  S ::= . E $                  I1:  E ::= T .
     E ::= . E + T                     T ::= T . * F
     E ::= . T                    
     T ::= . T * F
     T ::= . F
     F ::= . id
     F ::= . ( E )

State I1 has a shift-reduce conflict. Another example is:

S ::= X
X ::= Y
    | id
Y ::= id

which includes the following two item sets:

I0:  S ::= . X $
     X ::= . Y             I1:  X ::= id .
     X ::= . id                 Y ::= id .
     Y ::= . id

State I1 has a reduce-reduce conflict.

Can we find an easy fix for the reduce-reduce and the shift-reduce conflicts? Consider the state I1 of the first example above. The FOLLOW of E is {$,+,)}. We can see that * is not in the FOLLOW[E], which means that if we could see the next token (called the lookahead token) and this token is *, then we can use the item T ::= T . * F to do a shift; otherwise, if the lookahead is one of {$,+,)}, we reduce by E ::= T. The same trick can be used in the case of a reduce-reduce conflict. For example, if we have a state with the items A ::= a . and B ::= b ., then if FOLLOW[A] doesn't overlap with FOLLOW[B], then we can deduce which production to reduce by inspecting the lookahead token. This grammar (where we can look one token ahead) is called SLR(1) and is more powerful than LR(0).

The previous trick is not always possible. Consider the grammar:

S ::= E $
E ::= L = R
    | R
L ::= * R
    | id
R ::= L

for C-style variable assignments. Consider the two states:

I0:  S ::= . E $
     E ::= . L = R             I1:  E ::= L . = R
     E ::= . R                      R ::= L .
     L ::= . * R
     L ::= . id
     R ::= . L

we have a shift-reduce conflict in I1. The FOLLOW of R is the union of the FOLLOW of E and the FOLLOW of L, ie, it is {$,=}. In this case, = is a member of the FOLLOW of R, so we can't decide shift or reduce using just one lookahead.

An even more powerful grammar is LR(1), described below. This grammar is not used in practice because of the large number of states it generates. A simplified version of this grammar, called LALR(1), has the same number of states as LR(0) but it is far more powerful than LR(0) (but less powerful than LR(1)). It is the most important grammar from all grammars we learned so far. CUP, Bison, and Yacc recognize LALR(1) grammars.

Both LR(1) and LALR(1) check one lookahead token (they read one token ahead from the input stream - in addition to the current token). An item used in LR(1) and LALR(1) is like an LR(0) item but with the addition of a set of expected lookaheads which indicate what lookahead tokens would make us perform a reduction when we are ready to reduce using this production rule. The expected lookahead symbols for a rule X ::= a are always a subset or equal to FOLLOW[X]. The idea is that an item in an itemset represents a potential for a reduction using the rule associated with the item. But each itemset (ie. state) can be reached after a number of transitions in the state diagram, which means that each itemset has an implicit context, which, in turn, implies that there are only few terminals permitted to appear in the input stream after reducing by this rule. In SLR(1), we made the assumption that the followup tokens after the reduction by X ::= a are exactly equal to FOLLOW[X]. But this is too conservative and may not help us resolve the conflicts. So the idea is to restrict this set of followup tokens by making a more careful analysis by taking into account the context in which the item appears. This will reduce the possibility of overlapping in sift/reduce or reduce/reduce conflict states. For example,

L ::= * . R, =$

indicates that we have the item L ::= * . R and that we have parsed * and will reduce by L ::= * R only when we parse R and the next lookahead token is either = or $ (end of file). It is very important to note that the FOLLOW of L (equal to {$,=}) is always a superset or equal to the expected lookaheads. The point of propagating expected lookaheads to the items is that we want to restrict the FOLLOW symbols so that only the relevant FOLLOW symbols would affect our decision when to shift or reduce in the case of a shift-reduce or reduce-reduce conflict. For example, if we have a state with two items A ::= a ., s1 and B ::= b ., s2, where s1 and s2 are sets of terminals, then if s1 and s2 are not overlapping, this is not a reduce-reduce error any more, since we can decide by inspecting the lookahead token.

So, when we construct the item sets, we also propagate the expected lookaheads. When we have a transition from A ::= a . s b by a symbol s, we just propagate the expected lookaheads. The tricky part is when we construct the closures of the items. Recall that when we have an item A ::= a . B b, where B is a nonterminal, then we add all rules B ::= . c in the item set. If A ::= a . B b has an expected lookahead t, then B ::= . c has all the elements of FIRST[bt] as expected lookaheads.

For example, consider the previous grammar:

S ::= E $
E ::= L = R
    | R
L ::= * R
    | id
R ::= L

Two of the LR(1) item sets are:

I0:  S ::= . E $     ?
     E ::= . L = R   $          I1:  E ::= L . = R   $
     E ::= . R       $               R ::= L .       $
     L ::= . * R     =$
     L ::= . id      =$
     R ::= . L       $

We always start with expected lookahead ? for the rule S ::= E $, which basically indicates that we don't care what is after end-of-file. The closure of L will contain both = and $ for expected lookaheads because in E ::= . L = R the first symbol after L is =, and in R ::= . L (the closure of E ::= . R) we use the $ terminal for expected lookahead propagated from E ::= . R since there is no symbol after L. We can see that there is no shift reduce error in I1: if the lookahead token is $ we reduce, otherwise we shift (for =).

In LR(1) parsing, an item A ::= a, s1 is different from A ::= a, s2 if s1 is different from s2. This results to a large number of states since the combinations of expected lookahead symbols can be very large. To reduce the number of states, when we have two items like those two, instead of creating two states (one for each item), we combine the two states into one by creating an item A := a, s3, where s3 is the union of s1 and s2. Since we make the expected lookahead sets larger, there is a danger that some conflicts will have worse chances to be resolved. But the number of states we get is the same as that for LR(0). This grammar is called LALR(1).

There is an easier way to construct the LALR(1) item sets. Simply start by constructing the LR(0) item sets. Then we add the expected lookaheads as follows: whenever we find a new lookahead using the closure law given above we add it in the proper item; when we propagate lookaheads we propagate all of them. Sometimes when we insert a new lookahead we need to do all the lookahead propagations again for the new lookahead even in the case we have already constructed the new items. So we may have to loop through the items many times until no new lookaheads are generated. Sounds hard? Well that's why there are parser generators to do that automatically for you. This is how CUP work.

Next: 3.3.5 Practical Considerations for Up: 3.3 Bottom-up Parsing Previous: 3.3.3 Table Construction Contents

2015-01-20