next up previous contents
Next: 6.2 Types and Type Up: 6 Semantic Analysis Previous: 6 Semantic Analysis   Contents

6.1 Symbol Tables

After ASTs have been constructed, the compiler must check whether the input program is type-correct. This is called type checking and is part of the semantic analysis. During type checking, a compiler checks whether the use of names (such as variables, functions, type names) is consistent with their definition in the program. For example, if a variable x has been defined to be of type int, then x+1 is correct since it adds two integers while x[1] is wrong. When the type checker detects an inconsistency, it reports an error (such as ``Error: an integer was expected"). Another example of an inconsistency is calling a function with fewer or more parameters or passing parameters of the wrong type.

Consequently, it is necessary to remember declarations so that we can detect inconsistencies and misuses during type checking. This is the task of a symbol table. Note that a symbol table is a compile-time data structure. It's not used during run time by statically typed languages. Formally, a symbol table maps names into declarations (called attributes), such as mapping the variable name x to its type int. More specifically, a symbol table stores:

One convenient data structure for symbol tables is a hash table. One organization of a hash table that resolves conflicts is chaining. The idea is that we have list elements of type:

class Symbol {
    public String key;
    public Object binding; 
    public Symbol next;
    public Symbol ( String k, Object v, Symbol r ) { key=k; binding=v; next=r; }
to link key-binding mappings. Here a binding is any Object, but it can be more specific (eg, a Type class to represent a type or a Declaration class, as we will see below). The hash table is a vector Symbol[] of size SIZE, where SIZE is a prime number large enough to have good performance for medium size programs (eg, SIZE=109). The hash function must map any key (ie. any string) into a bucket (ie. into a number between 0 and SIZE-1). A typical hash function is, h(key) = num(key) mod SIZE, where num converts a key to an integer and mod is the modulo operator. In the beginning, every bucket is null. When we insert a new mapping (which is a pair of key-binding), we calculate the bucket location by applying the hash function over the key, we insert the key-binding pair into a Symbol object, and we insert the object at the beginning of the bucket list. When we want to find the binding of a name, we calculate the bucket location using the hash function, and we search the element list of the bucket from the beginning to its end until we find the first element with the requested name.

Consider the following Java program:

1) { 
2)   int a;
3)   {
4)     int a;
5)     a = 1;
6)   };
7)   a = 2;
8) };
The statement a = 1 refers to the second integer a, while the statement a = 2 refers to the first. This is called a nested scope in programming languages. We need to modify the symbol table to handle structures and we need to implement the following operations for a symbol table:

We have already seen insert and lookup. When we have a new block (ie, when we encounter the token {), we begin a new scope. When we exit a block (ie. when we encounter the token }) we remove the scope (this is the end_scope). When we remove a scope, we remove all declarations inside this scope. So basically, scopes behave like stacks. One way to implement these functions is to use a stack of numbers (from 0 to SIZE) that refer to bucket numbers. When we begin a new scope we push a special marker to the stack (eg, -1). When we insert a new declaration in the hash table using insert, we also push the bucket number to the stack. When we end a scope, we pop the stack until and including the first -1 marker. For each bucket number we pop out from the stack, we remove the head of the binding list of the indicated bucket number. For example, for the previous C program, we have the following sequence of commands for each line in the source program (we assume that the hash key for a is 12):

1) push(-1)
2) insert the binding from a to int into the beginning of the list table[12]
3) push(-1)
4) insert the binding from a to int into the beginning of the list table[12]
6) pop()
   remove the head of table[12]
7) pop()
   remove the head of table[12]
Recall that when we search for a declaration using lookup, we search the bucket list from the beginning to the end, so that if we have multiple declarations with the same name, the declaration in the innermost scope overrides the declaration in the outer scope.

The textbook makes an improvement to the above data structure for symbol tables by storing all keys (strings) into another data structure and by using pointers instead of strings for keys in the hash table. This new data structure implements a set of strings (ie. no string appears more than once). This data structure too can be implemented using a hash table. The symbol table itself uses the physical address of a string in memory as the hash key to locate the bucket of the binding. Also the key component of element is a pointer to the string. The idea is that not only we save space, since we store a name once regardless of the number of its occurrences in a program, but we can also calculate the bucket location very fast.

next up previous contents
Next: 6.2 Types and Type Up: 6 Semantic Analysis Previous: 6 Semantic Analysis   Contents