A scanner groups input characters into tokens. For example, if the input is
x = x*(b+1);then the scanner generates the following sequence of tokens
id(x) = id(x) * ( id(b) + num(1) ) ;where
id(x) indicates the identifier with name x (a
program variable in this case) and num(1) indicates the integer
1. Each time the parser needs a token, it sends a request to
the scanner. Then, the scanner reads as many characters from the input
stream as it is necessary to construct a single token. The scanner
may report an error during scanning (eg, when it finds an end-of-file
in the middle of a string). Otherwise, when a single token is formed,
the scanner is suspended and returns the token to the parser. The
parser will repeatedly call the scanner to read all the tokens from
the input stream or until an error is detected (such as a syntax
error).
Tokens are typically represented by numbers. For example, the token
* may be assigned number 35. Some tokens require some extra
information. For example, an identifier is a token (so it is
represented by some number) but it is also associated with a string
that holds the identifier name. For example, the token id(x) is
associated with the string, "x". Similarly, the token
num(1) is associated with the number, 1.
Tokens are specified by patterns, called regular expressions.
For example, the regular expression [a-z][a-zA-Z0-9]*
recognizes all identifiers with at least one alphanumeric letter whose
first letter is lower-case alphabetic.
A typical scanner:
class in Java);
( and ), or
groups of special characters, such as := and ==;
#include "file"
directive in C) and macros.
A key issue is speed. One can always write a naive scanner that groups the input characters into lexical words (a lexical word can be either a sequence of alphanumeric characters without whitespaces or special characters, or just one special character), and then tries to associate a token (ie. number, keyword, identifier, etc) to this lexical word by performing a number of string comparisons. This becomes very expensive when there are many keywords and/or many special lexical patterns in the language. In this section you will learn how to build efficient scanners using regular expressions and finite automata. There are automated tools called scanner generators, such as flex for C and JLex for Java, which construct a fast scanner automatically according to specifications (regular expressions). You will first learn how to specify a scanner using regular expressions, then the underlying theory that scanner generators use to compile regular expressions into efficient programs (which are basically finite state machines), and then you will learn how to use a scanner generator for Java, called JLex.