...: Compiler Instruction Selection by Gabriel H Blindell

Collection of Terms

A compiler performs three tasks -
- Syntactic analysis [Frontend]
- Program optimization
- Code generation [Backend]
-- Instrument selection - which instruction to use
-- Instrument scheduling - scheduling the selected instructions
-- Register allocation - assigning registers to variables of the program

The compiler frontend must interpret inputs in only one way. The backend can generate variety of correct outputs.

RISC - reduced instruction set computer, is a type of architectures e.g. MIPS.

Single-output Instructions - those that produce a single observable output value e.g. arithmetic instructions such as add, sub or bit operations. It also includes insturctions that have a chain of computations to produce a single output e.g. load address from register a, add offset from register b. Almost all RISC instructions fall in this category.

Multi-output Instructions - those that produce multiple observable output values from the same input values e.g. divmod instruction that produces quotient and remainder or arithmetic instructions that produce an output and also set status flags e.g. arithmetic operation and overflow status.

Disjoint-output Instructions - those that produce multiple output values from multiple input values e.g. SIMD instructions which execute same operations simultaneously on multiple input values, x86 has AVX and SSE instructions for vector processing.

SIMD - single instruction multiple data instructions

Inter-block Instructions - whose behavior spans multiple blocks in the IR code. E.g. saturated arithmetic instructions i.e. where the output of the arithmetic operation is clamped, say permitted output range is -100 to +100, the saturated sum of 40 + 80 is 100. Customized architectures like ARM Cortex M-7 processor support these instructions, and the compiler provides intrinsics for these insturctions. In the IR code, these intrinsics are represented as special nodes which are then converted into special instructions.

Interdependent Instruction - which exhibit additional constraints when combined with other instructions e.g. on the TMS320C55x processor, the add instruction cannot be combined with rpt instruction if a particular addressing mode is used for the add instruction.

Instruction Set Architecture - ISA is the set of available instructions which target machine supports.

Assembly Language - is mnemonic names for machine code i.e. assembly instruction add represents a series of 0 and 1 in machine code. Assembler converts assembly to machine code.

Programming languages are high level instructions, where multiple operations are implicit in each statement and it is possible that a high level statement may not have a direct translation in assembly (i.e. it has to be emulated by multiple instructions). Thus a compiler converts single high level statement into one or more assembly instructions.

Intermediate Representation - IR code is produced by the compiler parsing and converting high level code into a representation which is independent of the programming language i.e. multiple programming languages can have their compilers convert them into the same IR code. The IR code can be subject to generic optimizations like dead code removal, constant folding, loop unrolling, etc. Finally IR code is provided to the backend to convert to target assembly code.

Optimal Instruction Selection

A selection of instruction is optimal if an instruction selector can identify a subset of ISA which implements the input at the least cost. Instruction selectors are not always capable of this if -
- they do not suport all instructions in the ISA
- the form of input may make the job of the selector easy or complicated
- after instruction scheduling and register allocation, the instructions selected may not be optimal

Compiler authors need to balance the coupling of instruction selection algorithms and target architecture i.e. a selector tightly coupled with an architecture will produce optimal code but will not be useful as a generic compiler implementation for other architectures and similarly a generic selector will not produce optimal code for any other architecture.

Further, instruction selection, scheduling and register allocation are three consecutive stages, but they are tightly coupled i.e. based on target architecture, the registers available, the register allocation stage may make the selected instructions sub-optimal OR the target architecture may support running instructions in parallel and thus the scheduler is capable of further optimizing the selectors output. Thus these three tasks are best done in unison.

Macro Expansion

Instruction selectors match the program code with templates and on a match execute a corresponding macro. The macro execution is done by the macro expander, it uses the matched program string as an argument. the macro execution is what emits the assembly code corresponding to the ISA of the target architecture.

For simplicity assume each line of program code is matched with a regular expression (template) and if the match succeeds a corresponding function (macro) converts the program code to assembly.

If a program code does not match with any template, the instruction selector raises an error. This situation should never occur. Macro expansion is separate from template matching, thus allowing compiler writers to easily target various architectures.

Most compilers convert the program source code into a machine independent internal representation (e.g. tree) and then apply template matching, macro expansion and other stages of compilation. This separates the optimization and compiler backend from the details of the programming language. The internal representation of program code as a tree (abstract syntax tree or AST) consists of -
- a data flow graph representing data operations
- a control flow graph representing program control flow
- each tree represents at most one block of code
- a block of code may be represented by more than one tree

Macro expansion needs to understand the number and type of available registers for a target machine, the movement of values in and out of these registers i.e. for a program source code instruction which involves adding two variables and storing the result in a third variable, is first converted into an abstract syntax tree, then the macros expand one node at a time in the tree, however they need to ensure that the appropriate variables are loaded into registers, the previous values are saved back to memory, finally the result is not overwritten by the macro for subsequent source code instruction.

Compiler writers have further designed two types of macros - machine independent and object machine i.e. for each AST node template match there is atleast one machine independent macro which converts the instruction into another form and one or more object machine macro which converts this to assembly and manages the target hardware specific details. The task of designing macros became complicated as target machines evolved and programming languages improved. Further research in this area led to development of Machine Description syntax which allowed defining the hardware resources of the target architecture and this could be parsed to generate machine specific macros.

Byte Code

An area of research in compilers led to an idea of generating machine independent instructions from the program source code. This is referred to as byte code and is executed by a runtime environment. They byte code remains the same for different target architectures, however the runtime environment changes for each target machine. This allows to compile once and run on different machines. The byte code is interpreted line by line, by the runtime environment and is thus not most optimized manner for running the code. Frequently used byte code can be compiled and not have to be interpreted every time.

Peephole Optimization

After assembly code has been generated, a round of optimization is performed which targets two adjacent instructions and attempts to replace them with a single complex instruction. The intent is to improve performance by reducing pairs of simple instructions. Usually these optimizers are capable of targeting two or three adjacent instructions with constraints that they must be within the same block i.e. not across block boundaries. The small window of optimization is the reason its named 'peephole'.

The Machine Description has been extended to additionally describe Register Transfers which is the observable effect each instruction has on the target machine's registers. The peephole optimizer performs two passes of the generated assembly code, first pass is backwards which determines register transfers for each instruction and removes those instructions which have no impact on the program behavior. The second pass combines register transfers of adjacent instructions and attempts to find a single instructino which would have the same affect.

Peephole optimization are static, they recognize and optimze assembly code for a fixed set of patterns.

Recap of the flow

Program source code -> Frontend converts to Internal Representation/AST -. Instruction Selection Macro Expansion -> Assembly code with RTL -> Peephole Optimization