!! LSU EE 4720 -- Fall 2002 -- Computer Architecture ! !! Compiler Optimization !! Under Construction ! ! Time-stamp: <9 September 2002, 9:31:47 CDT, koppel@drop> !! Contents ! ! Optimization Introduction ! Steps in Preparing a Program ! Compiler Definitions ! High-Level Optimizations ! Low-Level Optimizations ! Compiler Optimization Options !! References ! ! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach" !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Optimization Introduction ! :HP3: Section 2.11 ! :Def: Optimization ! The optional steps in compiling a program that reduce the program's ! execution time, reduce the size of the program, or both. ! :Example: ! ! Data on a program that computes pi with and without optimization. ! ! System: SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise ! Compiler: Sun WorkShop 6 2000/04/07 C 5.1 ! Clock Frequency: 300 MHz ! ! Without Optimization ! Size : 6408 bytes ! Run Time : 3.00 s ! Instruction Count : 325,338,749 ! CPI : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663 ! ! With Optimization ! Size : 6340 bytes Smaller! (OK, only a tiny bit smaller.) ! Run Time : 1.65 s Faster! ! Instruction Count : 100,338,751 ! CPI : (/ (* 1.65 300e6) 100338751.0) = 4.9333 ! !! Reasons *not* to Optimize ! ! Percentage shows roughly how often reason applies. ! ! 95% It makes debugging difficult, so don't optimize while debugging. ! This is true for almost everyone that uses a debugger. ! ! 10% It slows down compilation. ! Only important when there is a very large amount of code to ! recompile. (Back in the 20th century when computers were slow ! this was important.) ! ! 1% Optimization Introduces Bugs ! It does, but very rarely. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Steps in Preparing a Program ! Typical Steps in Preparing (Building) a Program ! ! Pre-Process, Compile, Assemble, Link, Load ! pi.c -> pi.i -> pi.s -> pi.o -> pi ! ! These steps can all be automatically performed of the compiler ! driver. (cc, gcc, MS Visual Studio, etc.) ! ! :Sample: cc pi.c -o pi ! ! They can also be specified individually. ! ! More details appear below, compile is what we're interested in. !! Pre-Process ! Load include files, expand macros. ! Typical pre-processor name: cpp ! Input File: pi.c ! Output File: pi.i (High-Level Language) !! Compile ! Convert high-level language in to assembler. ! Typical compiler name: cc1 (not cc, or gcc, or Visual Studio) ! Input File: pi.i ! Output File: pi.s (Assembler) !! Assemble ! Convert assembly language in to machine language. ! Typical assembler name: as ! Input File: pi.s ! Output File: pi.o (Object file containing machine code.) !! Link ! Combine object files, libraries, and other code into executable. ! Typical linker name: ld ! Input File: pi.o ! Output File: pi !! Load ! Copy executable in to memory, link with shared libraries, and start. ! Loader name: exec system call. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Compiler Definitions ! A program is compiled using a steps or /passes/. ! ! The first pass reads the pre-processed high-level source code. ! ! The last pass emits assembler code. ! ! Between the first and last passes the program is in some ! intermediate representation. ! ! The way passes are defined and organized varies greatly from ! compiler to compiler. ! :Def: Pass ! A step in compiling a program. A pass usually looks at an entire function. ! :Def: Intermediate Representation ! The form of a program (sort of a special-purpose language) used internally ! by a compiler. A compiler might use several intermediate representations. !! Typical Passes ! ! Parse ! Convert the source code to a high-level intermediate representation (H-IR). ! ! High-Level Optimization (Optional, may be done in several passes.) ! Modify H-IR to improve performance or reduce code size. ! Reads and writes H-IR ! ! Low-Level Intermediate Representation (L-IR) Generation ! Convert H-IR to a low-level intermediate representation (L-IR). ! ! Low-Level Optimizations (Optional, may be done in several passes.) ! Modify L-IR to improve performance or reduce code size. ! ! Register Assignment ! Choose machine registers. ! ! Low-Level Optimizations (Optional, may be done in several passes.) ! Modify L-IR to improve performance or reduce code size. ! ! Code Generation ! Convert L-IR to assembly code. ! ! Pigeonhole Optimizations (Optional, may be done in several passes.) ! Modify L-IR to improve performance or reduce code size. ! Some of these can be done at link time. ! :Def: Compiler Front End ! The parts of the compiler that do the parsing and high-level ! optimization passes. ! :Def: Compiler Back End ! The parts of the compiler that do the low-level optimization, ! register assignment, and code generation passes. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! High-Level Optimizations ! Easy high-level optimizations presented here. !! Some Easy-To-Explain Front-End Optimizations ! ! Dead-Code Elimination ! Common Subexpression Elimination ! Constant Propagation, Folding !! Dead-Code Elimination (DCE) ! Removal of code which isn't used. ! Yes, it happens. ! :Example: ! ! Code benefiting from DCE ! High-level code shown for clarity. Most compilers will transform ! an intermediate representation. ! ! Before: ! int main(int argv, char **argc) { double i; double sum = 0; for(i=1; i<50000000;) { sum = sum + 4.0 / i; i += 2; sum = sum - 4.0 / i; i += 2; } printf("Performed %d iterations. Thank you for running me.\n",i); } ! ! After: ! int main(int argv, char **argc) { double i; for(i=1; i<50000000;) { i += 2; i += 2; } printf("Performed %d iterations. Thank you for running me.\n",i); } ! ! Note: Other optimizations would leave only the printf. !! Common Subexpression Elimination ! Remove duplicated code. ! ! Before: r = ( a + b ) / ( x + y ); s = ( a + b ) / ( x - y ); ! After: temp = a + b; r = ( temp ) / ( x + y ); s = ( temp ) / ( x - y ); !! Constant Propagation, Folding ! The compiler performs whatever arithmetic it can at compile time ! rather than emitting code to perform the arithmetic at run time. ! ! Before: int sec_per_day = 60 * 60 * 24; int sec_per_year = sec_per_day * 365; some_routine(sec_per_day * x, sec_per_year * y); ! After: int sec_per_day = 86400; int sec_per_year = 31536000; some_routine(86400 * x, 31536000 * y); !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Low-Level Optimizations ! Some Low-Level Optimizations ! ! This is not a complete list. ! ! Register Assignment ! Instruction Selection ! Scheduling ! ! Additional low-level optimizations may be covered later in the semester. ! :Def: Register Assignment ! Selection of which values will be held in registers. Values ! not held in registers are stored in memory. ! ! Without Optimization ! All values corresponding to variables (in high-level program) are ! written to memory (not held in registers). Intermediate results are ! held in registers. ! ! With Optimization ! Registers are assigned to as many variables as possible, with priority ! given to frequently used variables. ! ! Advantage of Optimization ! Additional instructions are needed to read from and write to memory. ! :Example: ! ! pi program without and with optimization. ! ! Optimizations include register assignment and scheduling. ! ! Without Optimization ! ! 10 { ! 11 sum = sum + 4.0 / i; i += 2; ldd [%fp-24],%f6 ! f6 = sum ldd [%l0+0],%f4 ! f4 = 4.0 ldd [%fp-16],%f2 ! f2 = i fdivd %f4,%f2,%f2 ! temp1 = 4.0 / i faddd %f6,%f2,%f2 ! sum = sum + temp1 std %f2,[%fp-24] ldd [%fp-16],%f4 ! f4 = i ldd [%l0+8],%f2 ! f2 = 2.0 faddd %f4,%f2,%f2 ! i = i + 2.0 std %f2,[%fp-16] ! 12 sum = sum - 4.0 / i; i += 2; ldd [%fp-24],%f6 ldd [%l0+0],%f4 ldd [%fp-16],%f2 fdivd %f4,%f2,%f2 fsubd %f6,%f2,%f2 std %f2,[%fp-24] ldd [%fp-16],%f4 ldd [%l0+8],%f2 faddd %f4,%f2,%f2 std %f2,[%fp-16] ldd [%fp-16],%f4 ldd [%l0-8],%f2 fcmped %f4,%f2 nop fbl .L92 nop ! ! With Optimization ! ! Register assignment has eliminated the loading and storing of ! values from memory. ! (For example, register %f6 holding the value of sum in the ! un-optimized version above.) ! ! Scheduling has changed the order of the instructions. ! ! 10 ! { ! 11 ! sum = sum + 4.0 / i; i += 2; ! .L77000016: /* 0x0020 11 */ fdivd %f4,%f30,%f6 ! temp1 = 4.0 / i_2 /* 0x0024 */ faddd %f30,%f2,%f8 ! i_1 = i_2 + 2.0 ! 12 ! sum = sum - 4.0 / i; i += 2; /* 0x0028 12 */ faddd %f8,%f2,%f30 ! i_2 = i_1 + 2.0 /* 0x002c */ fcmped %f30,%f0 ! i_2 < 50000000 /* 0x0030 */ fdivd %f4,%f8,%f8 ! temp2 = 4.0 / i_1 /* 0x0034 11 */ faddd %f10,%f6,%f6 ! sum = sum + temp1 /* 0x0038 12 */ fbl .L77000016 /* 0x003c */ fsubd %f6,%f8,%f10 ! sum = sum - temp2 ! 13 ! } !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Compiler Optimization Options ! Under Construction ! Optimization options tell the compiler: ! ! How much effort to put into optimization. E.g., -O2 ! Which particular optimizations to perform. E.g., -fno-strength-reduce ! What kind of system the code will be running on. E.g., -xultra2 !! Optimization Effort ! ! Most compilers have optimization levels. ! ! The higher the number, the more optimizations done. ! :Example: ! ! Optimization Levels for Sun Workshop 6 Compiler ! -xO [1|2|3|4|5] ! Optimizes the object code. Note the upper-case letter ! O. ! The levels (1, 2, 3, 4, or 5) you can use differ ! according to the platform you are using. ! ( SPARC) ! -xO1 Does basic local optimization (peephole). ! -xO2 Does basic local and global optimization. ! This is induction variable elimination, local ! and global common subexpression elimination, ! algebraic simplification, copy propagation, ! constant propagation, loop-invariant optimi- ! zation, register allocation, basic block ! merging, tail recursion elimination, dead ! code elimination, tail call elimination and ! complex expression expansion. ! The -xO2 level does not assign global, exter- ! nal, or indirect references or definitions to ! registers. It treats these references and ! definitions as if they were declared "vola- ! tile." In general, the -xO2 level results in ! minimum code size. ! -xO3 Performs like -xO2 but, also optimizes refer- ! ences or definitions for external variables. ! Loop unrolling and software pipelining are ! also performed. The -xO3 level does not trace ! the effects of pointer assignments. When com- ! piling either device drivers, or programs ! that modify external variables from within ! signal handlers, you may need to use the ! volatile type qualifier to protect the object ! from optimization. In general, the -xO3 ! level results in increased code size. ! -xO4 Performs like -xO3 but, also does automatic ! inlining of functions contained in the same ! file; this usually improves execution speed. ! The -xO4 level does trace the effects of ! pointer assignments. In general, the -xO4 ! level results in increased code size. ! If you want to control which functions are ! inlined, see -xinline . ! -xO5 Generates the highest level of optimization. ! Uses optimization algorithms that take more ! compilation time or that do not have as high ! a certainty of improving execution time. ! Optimization at this level is more likely to ! improve performance if it is done with pro- ! file feedback. !! Particular Optimizations ! ! The levels specify sets of optimizations (like ordering the "Sport Package" ! for a new car). ! ! In contrast to optimization levels (-O3), the compiler can be told which ! particular optimizations to make. ! ! :Example: ! ! Some optimization options for gcc. ! `-ffast-math' ! This option allows GCC to violate some ANSI or IEEE rules and/or ! specifications in the interest of optimizing code for speed. For ! example, it allows the compiler to assume arguments to the `sqrt' ! function are non-negative numbers and that no floating-point values ! are NaNs. ! `-frerun-loop-opt' ! Run the loop optimizer twice. !! Targets ! ! Specify exact type of machine code will be run on. ! ! What might be specified: ! ! ISA ! The exact instruction set used. ! Specifies not just the ISA family, but a particular variation. ! A poor choice will limit the number of machines code can run on. ! ! Implementation ! Specify the implementation code will run on. ! A poor choice will result in slower code. ! ! Cache ! Specify configuration of cache. ! Caches covered later in the semester. ! A poor choice will result in slower code. ! :Example: ! ! Switches for Sun Workshop 6 Compiler. ! ! -xarch Specifies ISA ! -xchip Specifies implementation. ! -xcache Specifies cache. ! ! The following is from the Sun Workshop 6 Compiler manual. ! ! ! -xarch Values ! ! generic Set the parameters for the best performance ! over most 32-bit platform architectures. ! ! v7 Compile for the SPARC-V7 ISA. ! Enables the compiler to generate code for ! good performance on the V7 ISA. This is ! equivalent to using the best instruction set ! for good performance on the V8 ISA, but with- ! out integer mul and div instructions, and the ! fsmuld instruction. ! v8a Compile for the V8a version of the SPARC-V8 ! ISA. ! By definition, V8a means the V8 ISA, but ! without the fsmuld instruction. This option ! enables the compiler to generate code for ! good performance on the V8a ISA. ! Example: Any system based on the microSPARC I ! chip architecture ! v8 Compile for the SPARC-V8 ISA. ! Enables the compiler to generate code for ! good performance on the V8 architecture. ! Example: SPARCstation 10 ! -xchip=c ! Specifies the target processor for use by the optim- ! izer. ! ! This option specifies timing properties by specifying ! the target processor. ! Some effects are: ! o The ordering of instructions, that is, scheduling ! o The way the compiler uses branches ! o The instructions to use in cases where semantically ! equivalent alternatives are available ! The -xchip values are: ! generic ! Set the parameters for the best performance over ! most 32-bit platform architectures. ! generic64 ! Set the parameters for the best performance over ! most 64-bit platform architectures. ! native Set the parameters for the best performance on ! the host environment (assumes a 32-bit architec- ! ture). ! native64 ! Set the parameters for the best performance on ! the host environment (assumes a 64-bit architec- ! ture). ! old Optimize for pre-SuperSPARC(TM) processors. ! super Optimize for the SuperSPARC chip. ! super2 Optimize for the SuperSPARC II(TM) chip. ! micro Optimize for the microSPARC(TM) chip. ! micro2 Optimize for the microSPARC II(TM) chip. ! hyper Optimize for the hyperSPARC(TM) chip. ! hyper2 Optimize for the hyperSPARC II(TM) chip. ! powerup ! Optimize for the Weitek(R) PowerUp(TM) chip. ! ultra Optimize for the UltraSPARC(TM) chip. ! ultra2 Optimize for the UltraSPARC II(TM) chip. ! ultra2i ! Optimize for the UltraSPARC IIi(TM) chip. ! 386 Optimize for the Intel 386 architecture. ! 486 Optimize for the Intel 486 architecture. ! pentium ! Optimize for the Intel pentium architecture. ! pentium_pro ! Optimize for the Intel pentium_pro architecture.