A compiler can be viewed as a program that accepts a source code (such as a Java program) and generates machine code for some computer architecture. Suppose that you want to build compilers for n programming languages (eg, FORTRAN, C, C++, Java, BASIC, etc) and you want these compilers to run on m different architectures (eg,, MIPS, SPARC, Intel, alpha, etc). If you do that naively, you need to write n*m compilers, one for each language-architecture combination.
The holly grail of portability in compilers is to do the same thing by writing n + m programs only. How? You use a universal Intermediate Representation (IR) and you make the compiler a two-phase compiler. An IR is typically a tree-like data structure that captures the basic features of most computer architectures. One example of an IR tree node is a representation of a 3-address instruction, such as d s1 + s2, that gets two source addresses, s1 and s2, (ie. two IR trees) and produces one destination address, d. The first phase of this compilation scheme, called the front-end, maps the source code into IR, and the second phase, called the back-end, maps IR into machine code. That way, for each programming language that you want to compile, you write one front-end only, and for each computer architecture, you write one back-end. So totally you have n + m components.
But the above ideal separation of compilation into two phases does not work very well for real programming languages and architectures. Ideally, you must encode all knowledge about the source programming language in the front end, you must handle all machine architecture features in the back end, and you must design your IRs in such a way that all language and machine features are captured properly.
A typical real-world compiler usually has multiple phases. This increases the compiler's portability and simplifies retargeting. The front end consists of the following phases: