Regular expressions are a very convenient form of representing
(possibly infinite) sets of strings, called *regular sets*. For
example, the RE (*a*| *b*)**aa* represents the infinite set
{``*aa*",``*aaa*",``*baa*",``*abaa*",^{ ... }}, which is the set of all
strings with characters *a* and *b* that end in *aa*. Formally, a RE
is one of the following (along with the set of strings it
designates):

name | RE | designation |

epsilon | {``"} | |

symbol | a |
{``a"} for some character a |

concatenation | AB |
the set
{ rs| r , s }, where
rs is string |

concatenation, and and designate the REs A and B |
||

alternation | A| B |
the set
, where and designate the REs A and B |

repetition | A^{*} |
the set
| A|(AA)|(AAA)|^{ ... } (an infinite set) |

Repetition is also called Kleen closure.
For example, (*a*| *b*)*c* designates the set
{ *rs*| *r* ({``*a*"} {``*b*"}), *s* {``*c*"} },
which is equal to
{``*ac*",``*bc*"}.

We will use the following notational conveniences:

P^{+} |
= | PP^{*} |

P? |
= | P | |

[a - z] |
= | a| b| c|^{ ... }| z |

We can freely put parentheses around REs to denote the order of
evaluation. For example, (*a*| *b*)*c*. To avoid using many parentheses,
we use the following rules: concatenation and alternation are
associative (ie, *ABC* means (*AB*)*C* and is equivalent to *A*(*BC*)),
alternation is commutative (ie, *A*| *B* = *B*| *A*), repetition is
idempotent (ie,
*A*^{**} = *A*^{*}), and concatenation distributes
over alternation (eg,
(*a*| *b*)*c* = *ac*| *bc*).

For convenience, we can give names to REs so we can refer to them by their name. For example:

for - keyword |
= | for |

letter |
= | [a - zA - Z] |

digit |
= | [0 - 9] |

identifier |
= | letter (letter | digit)^{*} |

sign |
= | + | - | |

integer |
= | sign (0 | [1 - 9]digit^{*}) |

decimal |
= | integer . digit^{*} |

real |
= | (integer | decimal ) E sign digit^{+} |

There is some ambiguity though: If the input includes the characters
`for8`

, then the first rule (for *for-keyword*) matches 3
characters (`for`

), the fourth rule (for *identifier*) can
match 1, 2, 3, or 4 characters, the longest being `for8`

. To
resolve this type of ambiguities, when there is a choice of rules,
scanner generators choose the one that matches the maximum number of
characters. In this case, the chosen rule is the one for *identifier* that matches 4 characters (`for8`

). This
disambiguation rule is called the *longest match rule*. If there
are more than one rules that match the same maximum number of
characters, the rule listed first is chosen. This is the *rule
priority* disambiguation rule. For example, the lexical word
`for`

is taken as a *for-keyword* even though it uses the
same number of characters as an identifier.