next up previous contents
Next: 2.2 Deterministic Finite Automata Up: 2 Lexical Analysis Previous: 2 Lexical Analysis   Contents

2.1 Regular Expressions (REs)

Regular expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets. For example, the RE (a| b)*aa represents the infinite set {``aa",``aaa",``baa",``abaa", ... }, which is the set of all strings with characters a and b that end in aa. Formally, a RE is one of the following (along with the set of strings it designates):

name RE designation
epsilon $ \varepsilon$ {``"}
symbol a {``a"} for some character a
concatenation AB the set rsr $ \in$ $ \cal {A}$s $ \in$ $ \cal {B}$ }, where rs is string
    concatenation, and $ \cal {A}$ and $ \cal {B}$ designate the REs A and B
alternation A| B the set $ \cal {A}$ $ \cup$ $ \cal {B}$, where $ \cal {A}$ and $ \cal {B}$ designate the REs A and B
repetition A* the set $ \epsilon$| A|(AA)|(AAA)| ... (an infinite set)

Repetition is also called Kleen closure. For example, (a| b)c designates the set rsr $ \in$ ({``a"} $ \cup$ {``b"}), s $ \in$ {``c"} }, which is equal to {``ac",``bc"}.

We will use the following notational conveniences:

P+ = PP*
P? = P | $\displaystyle \varepsilon$
[a - z] = a| b| c| ... | z

We can freely put parentheses around REs to denote the order of evaluation. For example, (a| b)c. To avoid using many parentheses, we use the following rules: concatenation and alternation are associative (ie, ABC means (AB)C and is equivalent to A(BC)), alternation is commutative (ie, A| B = B| A), repetition is idempotent (ie, A** = A*), and concatenation distributes over alternation (eg, (a| b)c = ac| bc).

For convenience, we can give names to REs so we can refer to them by their name. For example:

for - keyword = for
letter = [a - zA - Z]
digit = [0 - 9]
identifier = letter (letter | digit)*
sign = + | - | $\displaystyle \varepsilon$
integer = sign (0 | [1 - 9]digit*)
decimal = integer . digit*
real = (integer | decimalE sign digit+

There is some ambiguity though: If the input includes the characters for8, then the first rule (for for-keyword) matches 3 characters (for), the fourth rule (for identifier) can match 1, 2, 3, or 4 characters, the longest being for8. To resolve this type of ambiguities, when there is a choice of rules, scanner generators choose the one that matches the maximum number of characters. In this case, the chosen rule is the one for identifier that matches 4 characters (for8). This disambiguation rule is called the longest match rule. If there are more than one rules that match the same maximum number of characters, the rule listed first is chosen. This is the rule priority disambiguation rule. For example, the lexical word for is taken as a for-keyword even though it uses the same number of characters as an identifier.


next up previous contents
Next: 2.2 Deterministic Finite Automata Up: 2 Lexical Analysis Previous: 2 Lexical Analysis   Contents
fegaras 2012-01-10