After ASTs have been constructed, the compiler must check whether the
input program is type-correct. This is called type checking and
is part of the semantic analysis. During type checking, a compiler
checks whether the use of names (such as variables, functions, type
names) is consistent with their definition in the program. For
example, if a variable x has been defined to be of type
int, then x+1 is correct since it adds two integers while
x[1] is wrong. When the type checker detects an inconsistency,
it reports an error (such as ``Error: an integer was expected").
Another example of an inconsistency is calling a function with fewer or
more parameters or passing parameters of the wrong type.
Consequently, it is necessary to remember declarations so that we can
detect inconsistencies and misuses during type checking. This is the
task of a symbol table. Note that a symbol table is a
compile-time data structure. It's not used during run time by statically typed
languages. Formally, a symbol table maps names into
declarations (called attributes), such as mapping the
variable name x to its type int. More specifically,
a symbol table stores:
typedef int* mytype, it maps the name mytype to a data structure
that represents the type int*).
One convenient data structure for symbol tables is a hash table. One organization of a hash table that resolves conflicts is chaining. The idea is that we have list elements of type:
class Symbol {
public String key;
public Object binding;
public Symbol next;
public Symbol ( String k, Object v, Symbol r ) { key=k; binding=v; next=r; }
}
to link key-binding mappings.
Here a binding is any Object, but it can be more specific (eg, a Type
class to represent a type or a Declaration class, as we will see below). The hash table is a vector
Symbol[] of size SIZE, where SIZE is a prime
number large enough to have good performance for medium size programs
(eg, SIZE=109). The hash function must map any key (ie. any
string) into a bucket (ie. into a number between 0 and SIZE-1).
A typical hash function is, h(key) = num(key) mod SIZE,
where num converts a key to an integer and mod is the modulo operator.
In the beginning, every bucket is null. When we insert a new mapping
(which is a pair of key-binding), we calculate the bucket location by
applying the hash function over the key, we insert the key-binding
pair into a Symbol object, and we insert the object at
the beginning of the bucket list. When we want to find the binding of
a name, we calculate the bucket location using the hash function, and
we search the element list of the bucket from the beginning to its end
until we find the first element with the requested name.
Consider the following Java program:
1) {
2) int a;
3) {
4) int a;
5) a = 1;
6) };
7) a = 2;
8) };
The statement a = 1 refers to the second integer a, while
the statement a = 2 refers to the first. This is called
a nested scope in programming languages. We need to modify the
symbol table to handle structures and we need to implement the following operations
for a symbol table:
insert ( String key, Object binding )
Object lookup ( String key )
begin_scope ()
end_scope ()
We have already seen insert and lookup. When we have a new block
(ie, when we encounter the token {), we begin a new
scope. When we exit a block (ie. when we encounter the token
}) we remove the scope (this is the end_scope). When we
remove a scope, we remove all declarations inside this scope. So
basically, scopes behave like stacks. One way to implement these
functions is to use a stack of numbers (from 0 to SIZE) that
refer to bucket numbers. When we begin a new scope we push a special
marker to the stack (eg, -1). When we insert a new declaration in
the hash table using insert, we also push the bucket number to
the stack. When we end a scope, we pop the stack until and including
the first -1 marker. For each bucket number we pop out from the stack,
we remove the head of the binding list of the indicated bucket number.
For example, for the previous C program, we have the following
sequence of commands for each line in the source program (we assume that
the hash key for a is 12):
1) push(-1) 2) insert the binding from a to int into the beginning of the list table[12] push(12) 3) push(-1) 4) insert the binding from a to int into the beginning of the list table[12] push(12) 6) pop() remove the head of table[12] pop() 7) pop() remove the head of table[12] pop()Recall that when we search for a declaration using lookup, we search the bucket list from the beginning to the end, so that if we have multiple declarations with the same name, the declaration in the innermost scope overrides the declaration in the outer scope.
The textbook makes an improvement to the above data structure for
symbol tables by storing all keys (strings) into another data
structure and by using pointers instead of strings for keys in the
hash table. This new data structure implements a set of strings
(ie. no string appears more than once). This data structure too can be
implemented using a hash table. The symbol table itself uses the
physical address of a string in memory as the hash key to locate the
bucket of the binding. Also the key component of element is a
pointer to the string. The idea is that not only we save space, since
we store a name once regardless of the number of its occurrences in a program,
but we can also calculate the bucket location very fast.