A scanner groups input characters into tokens. For example, if the input is
x = x*(b+1);then the scanner generates the following sequence of tokens
id(x) = id(x) * ( id(b) + num(1) ) ;where
id(x)indicates the identifier with name
x(a program variable in this case) and
num(1)indicates the integer
1. Each time the parser needs a token, it sends a request to the scanner. Then the scanner reads as many characters from the input stream as it is necessary to construct a single token. The scanner may report an error during scanning (eg, when it finds an end-of-file before a string terminates). Otherwise, when a single token is formed, the scanner is suspended and returns the token to the parser. The parser will repeatedly call the scanner to read all the tokens from the input stream or until an error is detected (such as a syntax error).
Tokens are typically represented by numbers. For example, the token
* may be the token with number 35. Some tokens require some
extra information. For example, an identifier is a token (so it is
represented by some number) but it is also associated with a string,
the identifier name. For example, the token
id(x) is associated
with the string,
"x". Similarly, the token
is associated with the number, 1.
Tokens are specified using patterns, called regular expressions.
For example, the regular expression
recognizes all identifiers with at least one alphanumeric letter whose
first letter is lower-case alphabetic.
A typical scanner:
), or groups of special characters, such as
#include "file"directive in C) and macros.
A key issue is speed. One can always write a dumb scanner that groups the input characters into lexical words (a lexical word can be either a sequence of alphanumeric characters without whitespaces or special characters, or just one special character), and then tries to recognize what token this lexical word is associated to (ie. number, keyword, identifier, etc) by performing a number of string comparisons. This becomes very expensive when there are many keywords and/or many special lexical patterns in the language. In this section you will learn how to build efficient scanners using regular expressions and finite automata. There are automated tools called scanner generators, such as lex and flex, in which you specify your scanner using regular expressions and these tools will construct a fast scanner for you according to your specification. You will first learn how to specify a scanner using regular expressions, then the underlying theory that scanner generators use to compile regular expressions into efficient programs (which are basically finite state machines), and then you will learn how to use a popular scanner generator, called flex.