Regular expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets. For example, the RE (a| b)*aa represents the infinite set {``aa'',``aaa'',``baa'',``abaa'', ... }, which is the set of all strings with characters a and b that end in aa. Formally, a RE is one of the following (along with the set of strings it designates):
| name | RE | designation |
| epsilon |
|
{``''} |
| symbol | a | {``a''} for some character a |
| concatenation | AB | the set
{ rs| r |
| alternation | A| B | the set
|
| repetition or Kleen closure | A* | the set
|
For example, (a| b)c designates the set
{ rs| r
({``a''}
{``b''}), s
{``c''} },
which is equal to
{``ac'',``bc''}.
We will use the following notational conveniences:
| P+ | = | PP* |
| P? | = | P | |
| [a - z] | = | a| b| c| ... | z |
We can freely put parentheses around REs to denote the order of evaluation. For example, (a| b)c. To avoid using many parentheses, we use the following rules: concatenation and alternation are associative (ie, ABC means (AB)C and is equivalent to A(BC)), alternation is commutative (ie, A| B = B| A), repetition is idempotent (ie, A** = A*), and concatenation distributes over alternation (eg, (a| b)c = ac| bc).
For convenience, we can give names to REs so we can refer to them by their name. For example:
| for |
= | for |
| letter | = | [a - zA - Z] |
| digit | = | [0 - 9] |
| identifier | = | letter (letter | digit)* |
| sign | = | + | - | |
| integer | = | sign (0 | [1 - 9]digit*) |
| decimal | = | integer . digit* |
| real | = | (integer | decimal ) E sign digit+ |
There is some ambiguity though: If the input includes the characters
for8, then the first rule (for for_keyword) matches 3
characters (for), the fourth rule (for identifier) can
match 1, 2, 3, or 4 characters, the longest being for8. To
resolve this type of ambiguities, when there is a choice of rules,
scanners choose the one that matches the maximum number of
characters. In this case, the chosen rule is the one for identifier that matches 4 characters. This disambiguation rule is
called the longest match rule. If there more than one rules that
match the same maximum number of characters, the rule listed first is
chosen. This is the rule priority disambiguation rule. For
example, the lexical word for is taken as a for_keyword
even though it uses the same number of characters as an identifier.