next up previous contents
Next: 2.2 Deterministic Finite Automata Up: 2 Lexical Analysis Previous: 2 Lexical Analysis

2.1 Regular Expressions (REs)

Regular expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets. For example, the RE (a| b)*aa represents the infinite set {``aa'',``aaa'',``baa'',``abaa'', ... }, which is the set of all strings with characters a and b that end in aa. Formally, a RE is one of the following (along with the set of strings it designates):

name RE designation
epsilon $\varepsilon$ {``''}
symbol a {``a''} for some character a
concatenation AB the set rsr $in$ $ \cal {A}$s $in$ $ \cal {B}$ }, where rs is string concatenation, and $ \cal {A}$ and $ \cal {B}$ are the designations of of the REs A and B
alternation A| B the set $ \cal {A}$ $ \cup$ $ \cal {B}$, where $ \cal {A}$ and $ \cal {B}$ are the designations of the REs A and B
repetition or Kleen closure A* the set $ \epsilon$| A|(AA)|(AAA)| ... (an infinite set)

For example, (a| b)c designates the set rsr $in$ ({``a''} $ \cup$ {``b''}), s $in$ {``c''} }, which is equal to {``ac'',``bc''}.

We will use the following notational conveniences:

P+ = PP*
P? = P | $\displaystyle \varepsilon$
[a - z] = a| b| c| ... | z

We can freely put parentheses around REs to denote the order of evaluation. For example, (a| b)c. To avoid using many parentheses, we use the following rules: concatenation and alternation are associative (ie, ABC means (AB)C and is equivalent to A(BC)), alternation is commutative (ie, A| B = B| A), repetition is idempotent (ie, A** = A*), and concatenation distributes over alternation (eg, (a| b)c = ac| bc).

For convenience, we can give names to REs so we can refer to them by their name. For example:

for$\displaystyle \_keyword$ = for
letter = [a - zA - Z]
digit = [0 - 9]
identifier = letter (letter | digit)*
sign = + | - | $\displaystyle \varepsilon$
integer = sign (0 | [1 - 9]digit*)
decimal = integer . digit*
real = (integer | decimalE sign digit+

There is some ambiguity though: If the input includes the characters for8, then the first rule (for for_keyword) matches 3 characters (for), the fourth rule (for identifier) can match 1, 2, 3, or 4 characters, the longest being for8. To resolve this type of ambiguities, when there is a choice of rules, scanners choose the one that matches the maximum number of characters. In this case, the chosen rule is the one for identifier that matches 4 characters. This disambiguation rule is called the longest match rule. If there more than one rules that match the same maximum number of characters, the rule listed first is chosen. This is the rule priority disambiguation rule. For example, the lexical word for is taken as a for_keyword even though it uses the same number of characters as an identifier.


next up previous contents
Next: 2.2 Deterministic Finite Automata Up: 2 Lexical Analysis Previous: 2 Lexical Analysis
Leonidas Fegaras
2002-08-26