Here is an example of a grammar that is not LR(0):
1) S ::= E $ 2) E ::= E + T 3) | T 4) T ::= T * F 5) | F 6) F ::= id 7) | ( E )
Let's focus on two item sets only:
I0: S ::= . E $ I1: E ::= T . E ::= . E + T T ::= T . * F E ::= . T T ::= . T * F T ::= . F F ::= . id F ::= . ( E )
State I1
has a shift-reduce conflict.
Another example is:
S ::= X X ::= Y | id Y ::= idwhich includes the following two item sets:
I0: S ::= . X $ X ::= . Y I1: X ::= id . X ::= . id Y ::= id . Y ::= . id
State I1
has a reduce-reduce conflict.
Can we find an easy fix for the reduce-reduce and the shift-reduce
conflicts? Consider the state I1
of the first example
above. The FOLLOW of E
is {$,+,)}
. We can see that
*
is not in the FOLLOW[E]
, which means that if we could
see the next token (called the lookahead token) and this token
is *
, then we can use the item T ::= T . * F
to do a
shift; otherwise, if the lookahead is one of {$,+,)}
, we reduce
by E ::= T
. The same trick can be used in the case of a
reduce-reduce conflict. For example, if we have a state with the
items A ::= a .
and B ::= b .
, then if FOLLOW[A]
doesn't overlap with FOLLOW[B]
, then we can deduce which
production to reduce by inspecting the lookahead token. This grammar
(where we can look one token ahead) is called SLR(1) and is more
powerful than LR(0).
The previous trick is not always possible. Consider the grammar:
S ::= E $ E ::= L = R | R L ::= * R | id R ::= Lfor C-style variable assignments. Consider the two states:
I0: S ::= . E $ E ::= . L = R I1: E ::= L . = R E ::= . R R ::= L . L ::= . * R L ::= . id R ::= . Lwe have a shift-reduce conflict in
I1
. The FOLLOW of R
is the union of the FOLLOW of E
and the FOLLOW of L
, ie, it is
{$,=}
. In this case, =
is a member of the FOLLOW of
R
, so we can't decide shift or reduce using just one lookahead.
An even more powerful grammar is LR(1), described below. This grammar is not used in practice because of the large number of states it generates. A simplified version of this grammar, called LALR(1), has the same number of states as LR(0) but it is far more powerful than LR(0) (but less powerful than LR(1)). It is the most important grammar from all grammars we learned so far. CUP, Bison, and Yacc recognize LALR(1) grammars.
Both LR(1) and LALR(1) check one lookahead token (they read one token
ahead from the input stream - in addition to the current token). An
item used in LR(1) and LALR(1) is like an LR(0) item but with the
addition of a set of expected lookaheads which indicate what lookahead
tokens would make us perform a reduction when we are ready to reduce
using this production rule. The expected lookahead symbols for a rule
X ::= a
are always a subset or equal to FOLLOW[X]
. The
idea is that an item in an itemset represents a potential for a
reduction using the rule associated with the item. But each itemset
(ie. state) can be reached after a number of transitions in the state
diagram, which means that each itemset has an implicit context, which,
in turn, implies that there are only few terminals permitted to appear
in the input stream after reducing by this rule. In SLR(1), we made
the assumption that the followup tokens after the reduction by
X ::= a
are exactly equal to FOLLOW[X]
. But this is too
conservative and may not help us resolve the conflicts. So the idea is
to restrict this set of followup tokens by making a more careful
analysis by taking into account the context in which the item
appears. This will reduce the possibility of overlapping in
sift/reduce or reduce/reduce conflict states. For example,
L ::= * . R, =$indicates that we have the item
L ::= * . R
and that we have parsed *
and will reduce by L ::= * R
only
when we parse R
and the next lookahead token
is either =
or $
(end of file). It is very important to
note that the FOLLOW of L
(equal to {$,=}
) is always a
superset or equal to the expected lookaheads. The point of propagating
expected lookaheads to the items is that we want to restrict the FOLLOW
symbols so that only the relevant FOLLOW symbols would affect our
decision when to shift or reduce in the case of a shift-reduce or
reduce-reduce conflict. For example, if we have a state with two items
A ::= a ., s1
and B ::= b ., s2
, where s1
and
s2
are sets of terminals, then if s1
and s2
are
not overlapping, this is not a reduce-reduce error any more, since we
can decide by inspecting the lookahead token.
So, when we construct the item sets, we also propagate the expected
lookaheads. When we have a transition from A ::= a . s b
by a
symbol s
, we just propagate the expected lookaheads. The
tricky part is when we construct the closures of the items. Recall
that when we have an item A ::= a . B b
, where B
is a
nonterminal, then we add all rules B ::= . c
in the item
set. If A ::= a . B b
has an expected lookahead t
, then
B ::= . c
has all the elements of FIRST[bt]
as expected
lookaheads.
For example, consider the previous grammar:
S ::= E $ E ::= L = R | R L ::= * R | id R ::= L
Two of the LR(1) item sets are:
I0: S ::= . E $ ? E ::= . L = R $ I1: E ::= L . = R $ E ::= . R $ R ::= L . $ L ::= . * R =$ L ::= . id =$ R ::= . L $
We always start with expected lookahead ?
for the rule
S ::= E $
, which basically indicates that we don't care what is
after end-of-file. The closure of L
will contain both =
and $
for expected lookaheads because in E ::= . L = R
the first symbol after L
is =
, and in R ::= . L
(the closure of E ::= . R
) we use the $
terminal for
expected lookahead propagated from E ::= . R
since there is no
symbol after L
. We can see that there is no shift reduce error
in I1
: if the lookahead token is $
we reduce, otherwise
we shift (for =
).
In LR(1) parsing, an item A ::= a, s1
is different from
A ::= a, s2
if s1
is different from s2
. This
results to a large number of states since the combinations of expected
lookahead symbols can be very large. To reduce the number of states,
when we have two items like those two, instead of creating two states
(one for each item), we combine the two states into one by creating an
item A := a, s3
, where s3
is the union of s1
and
s2
. Since we make the expected lookahead sets larger, there is
a danger that some conflicts will have worse chances to be resolved.
But the number of states we get is the same as that for LR(0). This
grammar is called LALR(1).
There is an easier way to construct the LALR(1) item sets. Simply start by constructing the LR(0) item sets. Then we add the expected lookaheads as follows: whenever we find a new lookahead using the closure law given above we add it in the proper item; when we propagate lookaheads we propagate all of them. Sometimes when we insert a new lookahead we need to do all the lookahead propagations again for the new lookahead even in the case we have already constructed the new items. So we may have to loop through the items many times until no new lookaheads are generated. Sounds hard? Well that's why there are parser generators to do that automatically for you. This is how CUP work.