1.3 Compiler Architecture

Next: 2 Lexical Analysis Up: 1 Introduction Previous: 1.2 What is the Contents

1.3 Compiler Architecture

A compiler can be viewed as a program that accepts a source code (such as a Java program) and generates machine code for some computer architecture. Suppose that you want to build compilers for n programming languages (eg, Scala, C, C++, Java, etc) and you want these compilers to run on m different architectures (eg, MIPS, x86, ARM, etc). If you do that naively, you need to write n*m compilers, one for each language-architecture combination.

The holly grail of portability in compilers is to do the same thing by writing n + m programs only. How? You use a universal Intermediate Representation (IR) and you make the compiler a two-phase compiler. An IR is typically a tree-like data structure that captures the basic features of most computer architectures. One example of an IR tree node is a representation of a 3-address instruction, such as d $\leftarrow$ s₁ + s₂, that gets two source addresses, s₁ and s₂, (ie. two IR trees) and produces one destination address, d. The first phase of this compilation scheme, called the front-end, maps the source code into IR, and the second phase, called the back-end, maps IR into machine code. That way, for each programming language you want to compile, you write one front-end only, and for each target computer architecture, you write one back-end. So, totally you have n + m components.

But the above ideal separation of compilation into two phases does not work very well for existing programming languages and architectures. Ideally, you must encode all knowledge about the source programming language in the front end, you must handle all machine architecture features in the back end, and you must design your IRs in such a way that all language and machine features are captured properly.

A typical real-world compiler usually has multiple phases. This increases the compiler's portability and simplifies retargeting. The front end consists of the following phases:

scanning: a scanner groups input characters into tokens;
parsing: a parser recognizes sequences of tokens according to some grammar and generates Abstract Syntax Trees (ASTs);
semantic analysis: performs type checking (ie, checking whether the variables, functions etc in the source program are used consistently with their definitions and with the language semantics) and translates ASTs into IRs;
optimization: optimizes IRs.

The back end consists of the following phases:

instruction selection: maps IRs into assembly code;
code optimization: optimizes the assembly code using control-flow and data-flow analyses, register allocation, etc;
code emission: generates machine code from assembly code.

The generated machine code is written in an object file. This file is not executable since it may refer to external symbols (such as system calls). The operating system provides the following utilities to execute the code:

linking: A linker takes several object files and libraries as input and produces one executable object file. It retrieves from the input files (and puts them together in the executable object file) the code of all the referenced functions/procedures and it resolves all external references to real addresses. The libraries include the operating sytem libraries, the language-specific libraries, and, maybe, user-created libraries.
loading: A loader loads an executable object file into memory, initializes the registers, heap, data, etc and starts the execution of the program.

Relocatable shared libraries allow effective memory use when many different applications share the same code.

Next: 2 Lexical Analysis Up: 1 Introduction Previous: 1.2 What is the Contents

2015-01-20