next up previous contents
Next: 2. Lexical Analysis Up: 1. Introduction Previous: 1.2 What is the   Contents

1.3 Compiler Architecture

A compiler can be viewed as a program that accepts a source code (such as a Java program) and generates machine code for some computer architecture. Suppose that you want to build compilers for n programming languages (eg, FORTRAN, C, C++, Java, BASIC, etc) and you want these compilers to run on m different architectures (eg,, MIPS, SPARC, Intel, alpha, etc). If you do that naively, you need to write n*m compilers, one for each language-architecture combination.

The holly grail of portability in compilers is to do the same thing by writing n+m programs only. How? You use a universal Intermediate Representation (IR) and you make the compiler a two-phase compiler. An IR is typically a tree-like data structure that captures the basic features of most computer architectures. One example of an IR tree node is a representation of a 3-address instruction, such as $d\leftarrow s_1+s_2$, that gets two source addresses, s1 and s2, (ie. two IR trees) and produces one destination address, d. The first phase of this compilation scheme, called the front-end, maps the source code into IR, and the second phase, called the back-end, maps IR into machine code. That way, for each programming language that you want to compile, you write one front-end only, and for each computer architecture, you write one back-end. So totally you have n+m components.

But the above ideal separation of compilation into two phases does not work very well for real programming languages and architectures. Ideally, you must encode all knowledge about the source programming language in the front end, you must handle all machine architecture features in the back end, and you must design your IRs in such a way that all language and machine features are captured properly.

A typical real-world compiler usually has multiple phases. This increases the compiler's portability and simplifies retargeting. The front end consists of the following phases:

1.
scanning: a scanner maps input characters into tokens
2.
parsing: a parser recognizes sequences of tokens according to some grammar and generates Abstract Syntax Trees (ASTs).
3.
Semantic analysis: this phase performs type checking (ie, checking whether the variables, functions etc in the source program are used consistently with their definitions and with the language semantics) and translates ASTs into IRs.
4.
optimization: to optimize IRs
The back end consists of the following phases:
1.
instruction selection: to map IRs into assembly code
2.
code optimization: to optimize the assembly code using control- and data flow analysis, register allocation, etc.
3.
code emission: to generate machine code from assembly code.


next up previous contents
Next: 2. Lexical Analysis Up: 1. Introduction Previous: 1.2 What is the   Contents
Leonidas Fegaras
2000-12-27