!! LSU EE 4720 -- Fall 2002 -- Computer Architecture
!
!! Compiler Optimization
!! Under Construction
!
! Time-stamp: <9 September 2002, 9:31:47 CDT, koppel@drop>
!! Contents
!
! Optimization Introduction
! Steps in Preparing a Program
! Compiler Definitions
! High-Level Optimizations
! Low-Level Optimizations
! Compiler Optimization Options
!! References
!
! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Optimization Introduction
! :HP3: Section 2.11
! :Def: Optimization
! The optional steps in compiling a program that reduce the program's
! execution time, reduce the size of the program, or both.
! :Example:
!
! Data on a program that computes pi with and without optimization.
!
! System: SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise
! Compiler: Sun WorkShop 6 2000/04/07 C 5.1
! Clock Frequency: 300 MHz
!
! Without Optimization
! Size : 6408 bytes
! Run Time : 3.00 s
! Instruction Count : 325,338,749
! CPI : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663
!
! With Optimization
! Size : 6340 bytes Smaller! (OK, only a tiny bit smaller.)
! Run Time : 1.65 s Faster!
! Instruction Count : 100,338,751
! CPI : (/ (* 1.65 300e6) 100338751.0) = 4.9333
!
!! Reasons *not* to Optimize
!
! Percentage shows roughly how often reason applies.
!
! 95% It makes debugging difficult, so don't optimize while debugging.
! This is true for almost everyone that uses a debugger.
!
! 10% It slows down compilation.
! Only important when there is a very large amount of code to
! recompile. (Back in the 20th century when computers were slow
! this was important.)
!
! 1% Optimization Introduces Bugs
! It does, but very rarely.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Steps in Preparing a Program
! Typical Steps in Preparing (Building) a Program
!
! Pre-Process, Compile, Assemble, Link, Load
! pi.c -> pi.i -> pi.s -> pi.o -> pi
!
! These steps can all be automatically performed of the compiler
! driver. (cc, gcc, MS Visual Studio, etc.)
!
! :Sample: cc pi.c -o pi
!
! They can also be specified individually.
!
! More details appear below, compile is what we're interested in.
!! Pre-Process
! Load include files, expand macros.
! Typical pre-processor name: cpp
! Input File: pi.c
! Output File: pi.i (High-Level Language)
!! Compile
! Convert high-level language in to assembler.
! Typical compiler name: cc1 (not cc, or gcc, or Visual Studio)
! Input File: pi.i
! Output File: pi.s (Assembler)
!! Assemble
! Convert assembly language in to machine language.
! Typical assembler name: as
! Input File: pi.s
! Output File: pi.o (Object file containing machine code.)
!! Link
! Combine object files, libraries, and other code into executable.
! Typical linker name: ld
! Input File: pi.o
! Output File: pi
!! Load
! Copy executable in to memory, link with shared libraries, and start.
! Loader name: exec system call.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Definitions
! A program is compiled using a steps or /passes/.
!
! The first pass reads the pre-processed high-level source code.
!
! The last pass emits assembler code.
!
! Between the first and last passes the program is in some
! intermediate representation.
!
! The way passes are defined and organized varies greatly from
! compiler to compiler.
! :Def: Pass
! A step in compiling a program. A pass usually looks at an entire function.
! :Def: Intermediate Representation
! The form of a program (sort of a special-purpose language) used internally
! by a compiler. A compiler might use several intermediate representations.
!! Typical Passes
!
! Parse
! Convert the source code to a high-level intermediate representation (H-IR).
!
! High-Level Optimization (Optional, may be done in several passes.)
! Modify H-IR to improve performance or reduce code size.
! Reads and writes H-IR
!
! Low-Level Intermediate Representation (L-IR) Generation
! Convert H-IR to a low-level intermediate representation (L-IR).
!
! Low-Level Optimizations (Optional, may be done in several passes.)
! Modify L-IR to improve performance or reduce code size.
!
! Register Assignment
! Choose machine registers.
!
! Low-Level Optimizations (Optional, may be done in several passes.)
! Modify L-IR to improve performance or reduce code size.
!
! Code Generation
! Convert L-IR to assembly code.
!
! Pigeonhole Optimizations (Optional, may be done in several passes.)
! Modify L-IR to improve performance or reduce code size.
! Some of these can be done at link time.
! :Def: Compiler Front End
! The parts of the compiler that do the parsing and high-level
! optimization passes.
! :Def: Compiler Back End
! The parts of the compiler that do the low-level optimization,
! register assignment, and code generation passes.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! High-Level Optimizations
! Easy high-level optimizations presented here.
!! Some Easy-To-Explain Front-End Optimizations
!
! Dead-Code Elimination
! Common Subexpression Elimination
! Constant Propagation, Folding
!! Dead-Code Elimination (DCE)
! Removal of code which isn't used.
! Yes, it happens.
! :Example:
!
! Code benefiting from DCE
! High-level code shown for clarity. Most compilers will transform
! an intermediate representation.
!
! Before:
!
int
main(int argv, char **argc)
{
double i; double sum = 0;
for(i=1; i<50000000;)
{
sum = sum + 4.0 / i;
i += 2;
sum = sum - 4.0 / i; i += 2;
}
printf("Performed %d iterations. Thank you for running me.\n",i);
}
!
! After:
!
int
main(int argv, char **argc)
{
double i;
for(i=1; i<50000000;)
{
i += 2;
i += 2;
}
printf("Performed %d iterations. Thank you for running me.\n",i);
}
!
! Note: Other optimizations would leave only the printf.
!! Common Subexpression Elimination
! Remove duplicated code.
!
! Before:
r = ( a + b ) / ( x + y );
s = ( a + b ) / ( x - y );
! After:
temp = a + b;
r = ( temp ) / ( x + y );
s = ( temp ) / ( x - y );
!! Constant Propagation, Folding
! The compiler performs whatever arithmetic it can at compile time
! rather than emitting code to perform the arithmetic at run time.
!
! Before:
int sec_per_day = 60 * 60 * 24;
int sec_per_year = sec_per_day * 365;
some_routine(sec_per_day * x, sec_per_year * y);
! After:
int sec_per_day = 86400;
int sec_per_year = 31536000;
some_routine(86400 * x, 31536000 * y);
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Low-Level Optimizations
! Some Low-Level Optimizations
!
! This is not a complete list.
!
! Register Assignment
! Instruction Selection
! Scheduling
!
! Additional low-level optimizations may be covered later in the semester.
! :Def: Register Assignment
! Selection of which values will be held in registers. Values
! not held in registers are stored in memory.
!
! Without Optimization
! All values corresponding to variables (in high-level program) are
! written to memory (not held in registers). Intermediate results are
! held in registers.
!
! With Optimization
! Registers are assigned to as many variables as possible, with priority
! given to frequently used variables.
!
! Advantage of Optimization
! Additional instructions are needed to read from and write to memory.
! :Example:
!
! pi program without and with optimization.
!
! Optimizations include register assignment and scheduling.
!
! Without Optimization
!
! 10 {
! 11 sum = sum + 4.0 / i; i += 2;
ldd [%fp-24],%f6 ! f6 = sum
ldd [%l0+0],%f4 ! f4 = 4.0
ldd [%fp-16],%f2 ! f2 = i
fdivd %f4,%f2,%f2 ! temp1 = 4.0 / i
faddd %f6,%f2,%f2 ! sum = sum + temp1
std %f2,[%fp-24]
ldd [%fp-16],%f4 ! f4 = i
ldd [%l0+8],%f2 ! f2 = 2.0
faddd %f4,%f2,%f2 ! i = i + 2.0
std %f2,[%fp-16]
! 12 sum = sum - 4.0 / i; i += 2;
ldd [%fp-24],%f6
ldd [%l0+0],%f4
ldd [%fp-16],%f2
fdivd %f4,%f2,%f2
fsubd %f6,%f2,%f2
std %f2,[%fp-24]
ldd [%fp-16],%f4
ldd [%l0+8],%f2
faddd %f4,%f2,%f2
std %f2,[%fp-16]
ldd [%fp-16],%f4
ldd [%l0-8],%f2
fcmped %f4,%f2
nop
fbl .L92
nop
!
! With Optimization
!
! Register assignment has eliminated the loading and storing of
! values from memory.
! (For example, register %f6 holding the value of sum in the
! un-optimized version above.)
!
! Scheduling has changed the order of the instructions.
!
! 10 ! {
! 11 ! sum = sum + 4.0 / i; i += 2;
!
.L77000016:
/* 0x0020 11 */ fdivd %f4,%f30,%f6 ! temp1 = 4.0 / i_2
/* 0x0024 */ faddd %f30,%f2,%f8 ! i_1 = i_2 + 2.0
! 12 ! sum = sum - 4.0 / i; i += 2;
/* 0x0028 12 */ faddd %f8,%f2,%f30 ! i_2 = i_1 + 2.0
/* 0x002c */ fcmped %f30,%f0 ! i_2 < 50000000
/* 0x0030 */ fdivd %f4,%f8,%f8 ! temp2 = 4.0 / i_1
/* 0x0034 11 */ faddd %f10,%f6,%f6 ! sum = sum + temp1
/* 0x0038 12 */ fbl .L77000016
/* 0x003c */ fsubd %f6,%f8,%f10 ! sum = sum - temp2
! 13 ! }
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Optimization Options
! Under Construction
! Optimization options tell the compiler:
!
! How much effort to put into optimization. E.g., -O2
! Which particular optimizations to perform. E.g., -fno-strength-reduce
! What kind of system the code will be running on. E.g., -xultra2
!! Optimization Effort
!
! Most compilers have optimization levels.
!
! The higher the number, the more optimizations done.
! :Example:
!
! Optimization Levels for Sun Workshop 6 Compiler
! -xO [1|2|3|4|5]
! Optimizes the object code. Note the upper-case letter
! O.
! The levels (1, 2, 3, 4, or 5) you can use differ
! according to the platform you are using.
! ( SPARC)
! -xO1 Does basic local optimization (peephole).
! -xO2 Does basic local and global optimization.
! This is induction variable elimination, local
! and global common subexpression elimination,
! algebraic simplification, copy propagation,
! constant propagation, loop-invariant optimi-
! zation, register allocation, basic block
! merging, tail recursion elimination, dead
! code elimination, tail call elimination and
! complex expression expansion.
! The -xO2 level does not assign global, exter-
! nal, or indirect references or definitions to
! registers. It treats these references and
! definitions as if they were declared "vola-
! tile." In general, the -xO2 level results in
! minimum code size.
! -xO3 Performs like -xO2 but, also optimizes refer-
! ences or definitions for external variables.
! Loop unrolling and software pipelining are
! also performed. The -xO3 level does not trace
! the effects of pointer assignments. When com-
! piling either device drivers, or programs
! that modify external variables from within
! signal handlers, you may need to use the
! volatile type qualifier to protect the object
! from optimization. In general, the -xO3
! level results in increased code size.
! -xO4 Performs like -xO3 but, also does automatic
! inlining of functions contained in the same
! file; this usually improves execution speed.
! The -xO4 level does trace the effects of
! pointer assignments. In general, the -xO4
! level results in increased code size.
! If you want to control which functions are
! inlined, see -xinline .
! -xO5 Generates the highest level of optimization.
! Uses optimization algorithms that take more
! compilation time or that do not have as high
! a certainty of improving execution time.
! Optimization at this level is more likely to
! improve performance if it is done with pro-
! file feedback.
!! Particular Optimizations
!
! The levels specify sets of optimizations (like ordering the "Sport Package"
! for a new car).
!
! In contrast to optimization levels (-O3), the compiler can be told which
! particular optimizations to make.
!
! :Example:
!
! Some optimization options for gcc.
! `-ffast-math'
! This option allows GCC to violate some ANSI or IEEE rules and/or
! specifications in the interest of optimizing code for speed. For
! example, it allows the compiler to assume arguments to the `sqrt'
! function are non-negative numbers and that no floating-point values
! are NaNs.
! `-frerun-loop-opt'
! Run the loop optimizer twice.
!! Targets
!
! Specify exact type of machine code will be run on.
!
! What might be specified:
!
! ISA
! The exact instruction set used.
! Specifies not just the ISA family, but a particular variation.
! A poor choice will limit the number of machines code can run on.
!
! Implementation
! Specify the implementation code will run on.
! A poor choice will result in slower code.
!
! Cache
! Specify configuration of cache.
! Caches covered later in the semester.
! A poor choice will result in slower code.
! :Example:
!
! Switches for Sun Workshop 6 Compiler.
!
! -xarch Specifies ISA
! -xchip Specifies implementation.
! -xcache Specifies cache.
!
! The following is from the Sun Workshop 6 Compiler manual.
!
!
! -xarch Values
!
! generic Set the parameters for the best performance
! over most 32-bit platform architectures.
!
! v7 Compile for the SPARC-V7 ISA.
! Enables the compiler to generate code for
! good performance on the V7 ISA. This is
! equivalent to using the best instruction set
! for good performance on the V8 ISA, but with-
! out integer mul and div instructions, and the
! fsmuld instruction.
! v8a Compile for the V8a version of the SPARC-V8
! ISA.
! By definition, V8a means the V8 ISA, but
! without the fsmuld instruction. This option
! enables the compiler to generate code for
! good performance on the V8a ISA.
! Example: Any system based on the microSPARC I
! chip architecture
! v8 Compile for the SPARC-V8 ISA.
! Enables the compiler to generate code for
! good performance on the V8 architecture.
! Example: SPARCstation 10
! -xchip=c
! Specifies the target processor for use by the optim-
! izer.
!
! This option specifies timing properties by specifying
! the target processor.
! Some effects are:
! o The ordering of instructions, that is, scheduling
! o The way the compiler uses branches
! o The instructions to use in cases where semantically
! equivalent alternatives are available
! The -xchip values are:
! generic
! Set the parameters for the best performance over
! most 32-bit platform architectures.
! generic64
! Set the parameters for the best performance over
! most 64-bit platform architectures.
! native Set the parameters for the best performance on
! the host environment (assumes a 32-bit architec-
! ture).
! native64
! Set the parameters for the best performance on
! the host environment (assumes a 64-bit architec-
! ture).
! old Optimize for pre-SuperSPARC(TM) processors.
! super Optimize for the SuperSPARC chip.
! super2 Optimize for the SuperSPARC II(TM) chip.
! micro Optimize for the microSPARC(TM) chip.
! micro2 Optimize for the microSPARC II(TM) chip.
! hyper Optimize for the hyperSPARC(TM) chip.
! hyper2 Optimize for the hyperSPARC II(TM) chip.
! powerup
! Optimize for the Weitek(R) PowerUp(TM) chip.
! ultra Optimize for the UltraSPARC(TM) chip.
! ultra2 Optimize for the UltraSPARC II(TM) chip.
! ultra2i
! Optimize for the UltraSPARC IIi(TM) chip.
! 386 Optimize for the Intel 386 architecture.
! 486 Optimize for the Intel 486 architecture.
! pentium
! Optimize for the Intel pentium architecture.
! pentium_pro
! Optimize for the Intel pentium_pro architecture.