next up previous contents
Next: 2.1 Regular Expressions (REs) Up: CSE 5317/4305: Design and Previous: 1.3 Compiler Architecture

2 Lexical Analysis

A scanner groups input characters into tokens. For example, if the input is

x = x*(b+1);
then the scanner generates the following sequence of tokens
where id(x) indicates the identifier with name x (a program variable in this case) and num(1) indicates the integer 1. Each time the parser needs a token, it sends a request to the scanner. Then the scanner reads as many characters from the input stream as it is necessary to construct a single token. The scanner may report an error during scanning (eg, when it finds an end-of-file before a string terminates). Otherwise, when a single token is formed, the scanner is suspended and returns the token to the parser. The parser will repeatedly call the scanner to read all the tokens from the input stream or until an error is detected (such as a syntax error).

Tokens are typically represented by numbers. For example, the token * may be the token with number 35. Some tokens require some extra information. For example, an identifier is a token (so it is represented by some number) but it is also associated with a string, the identifier name. For example, the token id(x) is associated with the string, "x". Similarly, the token num(1) is associated with the number, 1.

Tokens are specified using patterns, called regular expressions. For example, the regular expression [a-z][a-zA-Z0-9]* recognizes all identifiers with at least one alphanumeric letter whose first letter is lower-case alphabetic.

A typical scanner:

  1. recognizes the keywords of the language (these are the reserved words that have a special meaning in the language, such as the word begin in Pascal);
  2. recognizes special characters, such as ( and ), or groups of special characters, such as := and ==;
  3. recognizes identifiers, integers, reals, decimals, strings, etc;
  4. ignores whitespaces (tabs and blanks) and comments;
  5. recognizes and processes special directives (such as the #include "file" directive in C) and macros.

A key issue is speed. One can always write a dumb scanner that groups the input characters into lexical words (a lexical word can be either a sequence of alphanumeric characters without whitespaces or special characters, or just one special character), and then tries to recognize what token this lexical word is associated to (ie. number, keyword, identifier, etc) by performing a number of string comparisons. This becomes very expensive when there are many keywords and/or many special lexical patterns in the language. In this section you will learn how to build efficient scanners using regular expressions and finite automata. There are automated tools called scanner generators, such as lex and flex, in which you specify your scanner using regular expressions and these tools will construct a fast scanner for you according to your specification. You will first learn how to specify a scanner using regular expressions, then the underlying theory that scanner generators use to compile regular expressions into efficient programs (which are basically finite state machines), and then you will learn how to use a popular scanner generator, called flex.

next up previous contents
Next: 2.1 Regular Expressions (REs) Up: CSE 5317/4305: Design and Previous: 1.3 Compiler Architecture
Leonidas Fegaras