Regular expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets. For example, the RE (a| b)*aa represents the infinite set {``aa",``aaa",``baa",``abaa", ... }, which is the set of all strings with characters a and b that end in aa. Formally, a RE is one of the following (along with the set of strings it designates):
| name | RE | designation |
| epsilon |
|
{``"} |
| symbol | a | {``a"} for some character a |
| concatenation | AB | the set
{ rs| r |
| concatenation, and |
||
| alternation | A| B | the set
|
| repetition | A* | the set
|
Repetition is also called Kleen closure.
For example, (a| b)c designates the set
{ rs| r
({``a"}
{``b"}), s
{``c"} },
which is equal to
{``ac",``bc"}.
We will use the following notational conveniences:
| P+ | = | PP* |
| P? | = | P | |
| [a - z] | = | a| b| c| ... | z |
We can freely put parentheses around REs to denote the order of evaluation. For example, (a| b)c. To avoid using many parentheses, we use the following rules: concatenation and alternation are associative (ie, ABC means (AB)C and is equivalent to A(BC)), alternation is commutative (ie, A| B = B| A), repetition is idempotent (ie, A** = A*), and concatenation distributes over alternation (eg, (a| b)c = ac| bc).
For convenience, we can give names to REs so we can refer to them by their name. For example:
| for - keyword | = | for |
| letter | = | [a - zA - Z] |
| digit | = | [0 - 9] |
| identifier | = | letter (letter | digit)* |
| sign | = | + | - | |
| integer | = | sign (0 | [1 - 9]digit*) |
| decimal | = | integer . digit* |
| real | = | (integer | decimal ) E sign digit+ |
There is some ambiguity though: If the input includes the characters
for8, then the first rule (for for-keyword) matches 3
characters (for), the fourth rule (for identifier) can
match 1, 2, 3, or 4 characters, the longest being for8. To
resolve this type of ambiguities, when there is a choice of rules,
scanner generators choose the one that matches the maximum number of
characters. In this case, the chosen rule is the one for identifier that matches 4 characters (for8). This
disambiguation rule is called the longest match rule. If there
are more than one rules that match the same maximum number of
characters, the rule listed first is chosen. This is the rule
priority disambiguation rule. For example, the lexical word
for is taken as a for-keyword even though it uses the
same number of characters as an identifier.