Regular expressions are a very convenient form of representing
(possibly infinite) sets of strings, called *regular sets*. For
example, the RE (*a*| *b*)**aa* represents the infinite set
{``*aa''*,``*aaa''*,``*baa''*,``*abaa''*,^{ ... }}, which is the set of all
strings with characters *a* and *b* that end in *aa*. Formally, a RE
is one of the following (along with the set of strings it
designates):

name | RE | designation |

epsilon | {``''} | |

symbol | a |
{``a''} for some character a |

concatenation | AB |
the set
{ rs| r , s }, where
rs is string concatenation, and and are the designations of of the REs A and B |

alternation | A| B |
the set
, where and are the designations of the REs A and B |

repetition or Kleen closure | A^{*} |
the set
| A|(AA)|(AAA)|^{ ... } (an infinite set) |

For example, (*a*| *b*)*c* designates the set
{ *rs*| *r* ({``*a''*} {``*b''*}), *s* {``*c''*} },
which is equal to
{``*ac''*,``*bc''*}.

We will use the following notational conveniences:

P^{+} |
= | PP^{*} |

P? |
= | P | |

[a - z] |
= | a| b| c|^{ ... }| z |

We can freely put parentheses around REs to denote the order of
evaluation. For example, (*a*| *b*)*c*. To avoid using many parentheses,
we use the following rules: concatenation and alternation are
associative (ie, *ABC* means (*AB*)*C* and is equivalent to *A*(*BC*)),
alternation is commutative (ie, *A*| *B* = *B*| *A*), repetition is
idempotent (ie,
*A*^{**} = *A*^{*}), and concatenation distributes
over alternation (eg,
(*a*| *b*)*c* = *ac*| *bc*).

For convenience, we can give names to REs so we can refer to them by their name. For example:

for |
= | for |

letter |
= | [a - zA - Z] |

digit |
= | [0 - 9] |

identifier |
= | letter (letter | digit)^{*} |

sign |
= | + | - | |

integer |
= | sign (0 | [1 - 9]digit^{*}) |

decimal |
= | integer . digit^{*} |

real |
= | (integer | decimal ) E sign digit^{+} |

There is some ambiguity though: If the input includes the characters
`for8`

, then the first rule (for *for_keyword*) matches 3
characters (`for`

), the fourth rule (for *identifier*) can
match 1, 2, 3, or 4 characters, the longest being `for8`

. To
resolve this type of ambiguities, when there is a choice of rules,
scanners choose the one that matches the maximum number of
characters. In this case, the chosen rule is the one for *identifier* that matches 4 characters. This disambiguation rule is
called the *longest match rule*. If there more than one rules that
match the same maximum number of characters, the rule listed first is
chosen. This is the *rule priority* disambiguation rule. For
example, the lexical word `for`

is taken as a *for_keyword*
even though it uses the same number of characters as an identifier.

2002-08-26