!! LSU EE 4720 -- Spring 2010 -- Computer Architecture
!
!! Compiler Optimization Lecture Notes
save %sp, 2, %g3
!! Under Construction
!
! Time-stamp: <8 March 2010, 10:06:04 CST, koppel@ellipse>
! CPI wo opt: (/ (* 0.85 1e9) 437720410.0) = 1.9419
! CPI: (/ (* 0.34 1e9) 100220398.0) = 3.3925
!! Contents
!
! Optimization Introduction
! Steps in Preparing a Program
! Compiler Definitions
! High-Level Optimizations
! Low-Level Optimizations
! Compiler Optimization Options
! Use of Compiler Switches
!! References
!
! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach"
!! Lecture Goals
!
! Understand the Program Building and Compilation Process
! Describe steps in program building and optimization, including
! intermediate files (assembler, object, ...) and tool names (preprocessor,
! compiler, etc.).
!
! Understand Specific Optimizations and Assumption Switches
! Describe, work example, explain benefit.
!
! Understand Profiling
! Steps, how performance is improved.
!
! Understand ISA and Implementation Options
! How programmer chooses them, how compiler uses them.
!
! Understand how Programmers Use Compilation Options
! Normal program development, high-performance programs, SPEC disclosure.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Optimization Introduction
! :HP3: Section 2.11
! :Def: Optimization
! The optional steps in compiling a program that reduce the program's
! execution time, reduce the size of the program, or both.
! Typically, the only time optimization is NOT done is when a program
! is being debugged.
!
! In most cases, the programmer sets overall optimization effort (low,
! medium, high).
!
! When performance is very important the programmer can specify which
! specific optimizations to use.
! :Example:
!
! Data on a program that computes pi with and without optimization.
!
! System: SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise
! Compiler: Sun WorkShop 6 2000/04/07 C 5.1
! Clock Frequency: 300 MHz
!
! Without Optimization
! Size : 6408 bytes
! Run Time : 3.00 s
! Instruction Count : 325,338,749
! CPI : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663
!
! With Optimization
! Size : 6340 bytes Smaller! (OK, only a tiny bit smaller.)
! Run Time : 1.65 s Faster!
! Instruction Count : 100,338,751
! CPI : (/ (* 1.65 300e6) 100338751.0) = 4.9333
!
!! Reasons *not* to Optimize
!
! Percentage shows roughly how often reason applies.
!
! 95% It makes debugging difficult, so don't optimize while debugging.
! This is true for almost everyone that uses a debugger.
!
! 10% It slows down compilation.
! Only important when there is a very large amount of code to
! recompile. (Back in the 20th century when computers were slow
! this was important.)
!
! 1% Optimization introduces bugs.
! It does, but very rarely.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Steps in Building a Program
! Typical Steps in Building a Program
!
! Pre-Process, Compile, Assemble, Link, Load
! pi.c -> pi.i -> pi.s -> pi.o -> pi
!
! These steps can all be automatically performed of the compiler
! driver. (cc, gcc, MS Visual Studio, etc.)
!
! :Sample: cc pi.c -o pi
!
! They can also be specified individually.
!
! More details appear below, compile is what we're interested in.
!! Pre-Process
! Load include files, expand macros.
! Typical pre-processor name: cpp
! Input File: pi.c
! Output File: pi.i (High-Level Language)
!! Compile
! Convert high-level language in to assembler.
! Typical compiler name: cc1 (not cc, or gcc, or Visual Studio)
! Input File: pi.i
! Output File: pi.s (Assembler)
!! Assemble
! Convert assembly language in to machine language.
! Typical assembler name: as
! Input File: pi.s
! Output File: pi.o (Object file containing machine code.)
!! Link
! Combine object files, libraries, and other code into executable.
! Typical linker name: ld
! Input File: pi.o
! Output File: pi
!! Load
! Copy executable in to memory, link with shared libraries, and start.
! Loader name: exec system call.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Definitions
! A program is compiled using a steps or /passes/.
!
! The first pass reads the pre-processed high-level source code.
!
! The last pass emits assembler code.
!
! Between the first and last passes the program is in some
! intermediate representation.
!
! The way passes are defined and organized varies greatly from
! compiler to compiler.
! :Def: Pass
! A step in compiling a program. A pass usually looks at an entire function.
! :Def: Intermediate Representation
! The form of a program (sort of a special-purpose language) used internally
! by a compiler. A compiler might use several intermediate representations.
!! Typical Passes
!
! Parse
! Convert the source code to a high-level intermediate representation (H-IR).
!
! High-Level Optimization
! Optional, may be done in several passes.
! Also called front-end optimization.
! Modify H-IR to improve performance or reduce code size.
! Reads and writes H-IR
!
! Low-Level Intermediate Representation (L-IR) Generation
! Convert H-IR to a low-level intermediate representation (L-IR).
!
! Low-Level Optimization (Optional, may be done in several passes.)
! Also called back-end optimization.
! Modify L-IR to improve performance or reduce code size.
!
! Register Assignment (Part of low-level optimization.)
! Choose machine registers.
!
! Code Generation
! Convert L-IR to assembly code.
!
! Pigeonhole Optimizations (Optional, may be done in several passes.)
! These are also called low-level optimizations.
! Modify L-IR to improve performance or reduce code size.
! Some of these can be done at link time.
! :Def: Compiler Front End
! The parts of the compiler that do the parsing and high-level
! optimization passes.
! :Def: Compiler Back End
! The parts of the compiler that do the low-level optimization,
! register assignment, and code generation passes.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! High-Level Optimizations
! Easy high-level optimizations presented here.
!! Some Easy-To-Explain Front-End Optimizations
!
! Dead-Code Elimination (DCE)
! Common Subexpression Elimination (CSE)
! Constant Propagation, Folding
! :Def: Dead-Code Elimination (DCE)
! Removal of code which isn't used.
! Yes, it happens.
! This can also be a low-level optimization.
! :Example:
!
! Code benefiting from DCE
! High-level code shown for clarity. Most compilers will transform
! an intermediate representation.
!
! Before:
!
int
main(int argv, char **argc)
{
double i; double sum = 0;
for(i=1; i<50000000;)
{
sum = sum + 4.0 / i; i += 2;
sum = sum - 4.0 / i; i += 2;
}
printf("Performed %d iterations. Thank you for running me.\n",i);
}
!
! After:
!
int
main(int argv, char **argc)
{
double i;
for(i=1; i<50000000;)
{
i += 2;
i += 2;
}
printf("Performed %d iterations. Thank you for running me.\n",i);
}
!
! Note: Other optimizations would leave only the printf.
! :Def: Common Subexpression Elimination (CSE)
! Remove duplicated code.
!
! Before:
r = ( a + b ) / ( x + y );
s = ( a + b ) / ( x - y );
! After:
temp = a + b;
r = ( temp ) / ( x + y );
s = ( temp ) / ( x - y );
! :Def: Constant Propagation, Folding
! The compiler performs whatever arithmetic it can at compile time
! rather than emitting code to perform the arithmetic at run time.
!
! Before:
int sec_per_day = 60 * 60 * 24;
int sec_per_year = sec_per_day * 365;
some_routine(sec_per_day * x, sec_per_year * y);
! After:
int sec_per_day = 86400;
int sec_per_year = 31536000;
some_routine(86400 * x, 31536000 * y);
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Low-Level Optimizations
! Some Low-Level Optimizations
!
! This is not a complete list.
!
! Register Assignment
! Instruction Selection
! Scheduling
! :Def: Register Assignment
! Selection of which values will be held in registers. Values
! not held in registers are stored in memory.
!
! Without Register Assignment Optimizations
! All values corresponding to variables (in high-level program) are
! written to memory (and not held in registers). Intermediate results are
! held in registers.
!
! With Register Assignment Optimization
! Registers are assigned to as many variables as possible, with priority
! given to frequently used variables.
!
! Advantage of Register Assignment Optimization
! Fewer memory writes and reads.
! :Def: Scheduling
! Re-arranging instructions to minimize the amount of time one instruction
! has to wait for another.
!
! For example, if an instruction takes a long time it will be started early
! so that other instructions will not have to wait for its result.
!
! Scheduling will be covered in more detail later in the semester.
! :Example:
!
! pi program without and with optimization.
!
! Optimizations include register assignment and scheduling.
!
! Without Optimization
!
! 10 {
! 11 sum = sum + 4.0 / i; i += 2;
ldd [%fp-24],%f6 ! f6 = sum
ldd [%l0+0],%f4 ! f4 = 4.0
ldd [%fp-16],%f2 ! f2 = i
fdivd %f4,%f2,%f2 ! f2 = 4.0 / i
faddd %f6,%f2,%f2 ! f2 = sum + (4.0/i)
std %f2,[%fp-24] ! sum = f2
ldd [%fp-16],%f4 ! f4 = i
ldd [%l0+8],%f2 ! f2 = 2.0
faddd %f4,%f2,%f2 ! f4 = i + 2.0
std %f2,[%fp-16] ! i = f4
! 12 sum = sum - 4.0 / i; i += 2;
ldd [%fp-24],%f6 ! f6 = sum
ldd [%l0+0],%f4 ! f4 = 4.0
ldd [%fp-16],%f2 ! f2 = i
fdivd %f4,%f2,%f2 ! f2 = 4.0 / i
fsubd %f6,%f2,%f2 ! f2 = sum - (4.0/i)
std %f2,[%fp-24] ! sum = f2
ldd [%fp-16],%f4 ! f4 = i
ldd [%l0+8],%f2 ! f2 = 2.0
faddd %f4,%f2,%f2 ! f2 = i + 2.0
std %f2,[%fp-16] ! i = f2
ldd [%fp-16],%f4 ! f4 = i
ldd [%l0-8],%f2 ! f2 = 50000000
fcmped %f4,%f2 ! compare i, 50000000
nop
fbl .L92 ! Branch if FP comparison less than.
nop
!
! With Optimization
!
! 10 ! {
! 11 ! sum = sum + 4.0 / i; i += 2;
!
.L77000016:
/* 0x0020 11 */ fdivd %f4,%f30,%f6 ! temp1 = 4.0 / i_2
/* 0x0024 */ faddd %f30,%f2,%f8 ! i_1 = i_2 + 2.0
! 12 ! sum = sum - 4.0 / i; i += 2;
/* 0x0028 12 */ faddd %f8,%f2,%f30 ! i_2 = i_1 + 2.0
/* 0x002c */ fcmped %f30,%f0 ! i_2 < 50000000
/* 0x0030 */ fdivd %f4,%f8,%f8 ! temp2 = 4.0 / i_1
/* 0x0034 11 */ faddd %f10,%f6,%f6 ! sum = sum + temp1
/* 0x0038 12 */ fbl .L77000016
/* 0x003c */ fsubd %f6,%f8,%f10 ! sum = sum - temp2
! 13 ! }
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Optimization Options
! Under Construction
! Optimization options tell the compiler:
!
! How much EFFORT to put into optimization. E.g., -O2
! Which PARTICULAR optimizations to perform. E.g., -fno-strength-reduce
! The TARGET system the code will be running on. E.g., -xultra2
! Whether to make certain ASSUMPTIONS about the code.
! Whether to use PROFILING data from a training run.
!! Target Options
!
! Specify exact type of machine code will be run on:f ISA, Implementation, Cache
!
! Choice is based on type of machines customers have.
!
!
!! Specifying the: ISA
!
! The exact instruction set used.
! Specifies not just the ISA family, but a particular variation.
! A poor choice will limit the number of machines code can run on.
!
!! Specifying the: Implementation (Processor Core)
!
! Specify the implementation code will run on.
! A poor choice will result in slower code.
!
!! Specifying the: Cache (Can be considered part of ISA implementation.)
!
! Specify configuration of cache.
! Caches covered later in the semester.
! A poor choice will result in slower code.
!! Background
!
! Sun SPARC Architecture
!
! Developed by Sun
!
! Major Versions: v7, v8, v9, v9+vis
!
! Some Implementations
! UltraSPARC T2 Implements v9, (On newer Sun computers.)
! UltraSPARC VII Implements v9, (On newer Fujitsu computers.)
! UltraSPARC IV Implements v9, (On newer Sun computers.)
! :Example:
!
! Switches for Sun Workshop 6 Compiler.
!
! -xarch Specifies ISA
! -xchip Specifies implementation.
! -xcache Specifies cache.
! -xtarget Specifies system. ISA/implementation and cache.
!
! -xarch Values:
! v7, v8a, v8, v8plus (v8 with 32-bit v9), v8plusa (v8 with 32 bit v9 etc.),
! v8plusb, v9, v9a, v9b
! -xchip=c
! c must be one of: generic, old, super, super2, micro,
! micro2, hyper, hyper2, powerup, ultra, ultra2, ultra2e,
! ultra2i, ultra3, ultra3cu, 386, 486, pentium,
! pentium_pro.
!! Optimization EFFORT
!
! :Def: Optimization Effort
! Amount of optimization to be done. A small effort means performing
! only easy optimizations, a large effort means performing more
! time-consuming optimizations.
!
! Most compilers have optimization levels.
!
! The higher the number, the more optimizations done.
! :Example:
!
! Optimization Levels for Sun Workshop 6 Compiler
! -xO [1|2|3|4|5]
! Optimizes the object code. Note the upper-case letter
! O.
! The levels (1, 2, 3, 4, or 5) you can use differ
! according to the platform you are using.
! ( SPARC)
! -xO1 Does basic local optimization (peephole).
! -xO2 Does basic local and global optimization.
! This is induction variable elimination, local
! and global common subexpression elimination,
! algebraic simplification, copy propagation,
! constant propagation, loop-invariant optimi-
! zation, register allocation, basic block
! merging, tail recursion elimination, dead
! code elimination, tail call elimination and
! complex expression expansion.
! The -xO2 level does not assign global, exter-
! nal, or indirect references or definitions to
! registers. It treats these references and
! definitions as if they were declared "vola-
! tile." In general, the -xO2 level results in
! minimum code size.
! -xO3 Performs like -xO2 but, also optimizes refer-
! ences or definitions for external variables.
! Loop unrolling and software pipelining are
! also performed. The -xO3 level does not trace
! the effects of pointer assignments. When com-
! piling either device drivers, or programs
! that modify external variables from within
! signal handlers, you may need to use the
! volatile type qualifier to protect the object
! from optimization. In general, the -xO3
! level results in increased code size.
! -xO4 Performs like -xO3 but, also does automatic
! inlining of functions contained in the same
! file; this usually improves execution speed.
! The -xO4 level does trace the effects of
! pointer assignments. In general, the -xO4
! level results in increased code size.
! If you want to control which functions are
! inlined, see -xinline .
! -xO5 Generates the highest level of optimization.
! Uses optimization algorithms that take more
! compilation time or that do not have as high
! a certainty of improving execution time.
! Optimization at this level is more likely to
! improve performance if it is done with pro-
! file feedback.
!! PARTICULAR Optimizations
!
! The levels specify sets of optimizations (like ordering the "Sport Package"
! for a new car).
!
! In contrast to optimization levels (-O3), the compiler can be told which
! particular optimizations to make.
!
! These are typically used by skilled programmers trying to get
! fastest code.
!
! Some examples below.
! gcc
!
! `-frerun-loop-opt'
! Run the loop optimizer twice.
! Sun Workshop 6
!
! Prefetch instructions read memory in advance, eliminating some cache
! misses (covered later in the semester). They can also increase
! cache misses or increase the time to access memory, hurting
! performance. The compiler does not know if they will help or hurt.
!
! -xprefetch[=val],val
! (SPARC) Enable prefetch instructions on those architec-
! tures that support prefetch, such as UltraSPARC II.
! (-xarch=v8plus, v9plusa, v9, or v9a)
! Explicit prefetching should only be used under special
! circumstances that are supported by measurements.
!! ASSUMPTIONS (Assertions) About the Program
! Compilers must generate correct code.
!
! That is, the code must execute in the way specified by the
! high-level language definition.
!
! Correct code can be slow.
!
! The compiled code might need to check for things that can happen ...
! ... but don't in a particular program.
!
! Some options tell the compiler to make assumptions about the program.
!
! These assumptions would not hold for every program.
!
! The compiled program runs faster ...
! ... and correctly if the assumptions are valid.
! Some switches specifying assumptions:
! gcc, Assume the program does not require strict IEEE 754 FP features.
! `-ffast-math'
! This option allows GCC to violate some ANSI or IEEE rules and/or
! specifications in the interest of optimizing code for speed. For
! example, it allows the compiler to assume arguments to the `sqrt'
! function are non-negative numbers and that no floating-point values
! are NaNs.
! cc (Sun Workshop 6 Compiler) Assume certain pointers do not overlap.
!
! -xrestrict=f
! (SPARC) Treats pointer-valued function parameters as
! restricted pointers. f is a comma-separated list that
! consists of one or more function parameters, %all,
! %none. This command-line option can be used on its
! own, but is best used with optimization of -xO3 or
! greater.
!
! The default is %none. Specifying -xrestrict is
! equivalent to specifying -xrestrict=%all.
!
! :Example:
!
! In the loop below the compiler would ordinarily load the value at x
! from memory (dereference) each iteration (five times) because the
! address of x may be the same as the address of one of the "a"
! elements. With -xrestrict switch, x is not loaded each iteration,
! saving time and space. The switch is needed because the compiler
! has no way of knowing if x and the a's overlap and must otherwise
! make a conservative assumption (that they overlap)
void array_add(int *a, int *b, int *x)
{
int i;
for(i=0; i<5; i++)
a[i] = b[i] + x[0];
}
! Compiled Code Without -xrestrict
! 24 ! int i;
! 25 ! for(i=0; i<5; i++)
! 26 ! a[i] = b[i] + *x;
! %g4 i
! %o2 x
! %g3 &a[i]
! %g2 &b[i]
/* 0x0004 26 */ ld [%o1],%o5 ! o5 = Mem[o1]
/* 0x0008 23 */ or %g0,%o1,%g2 ! g2 = 0 + o1
/* 0x000c 25 */ or %g0,0,%g4 ! g4 = 0 + 0
.L900000108: ! The loop body starts here.
/* 0x0010 26 */ ld [%o2],%o4 ! <-- Load x, notice that
/* 0x0014 */ add %g4,1,%g4 ! o2 does not change.
/* 0x0018 */ add %g2,4,%g2 ! g2 = g2 + 4
/* 0x001c */ cmp %g4,5
/* 0x0020 */ add %o5,%o4,%o5 ! o5 = b[i] + x[0]
/* 0x0024 */ st %o5,[%g3] ! a[i] = o5
/* 0x0028 */ add %g3,4,%g3
/* 0x002c */ bl,a .L900000108
/* 0x0030 */ ld [%g2],%o5 ! o5 = b[i+1]
.L77000022:
/* 0x0034 */ retl ! Result =
/* 0x0038 */ nop
! Compiled Code With -xrestrict
/* 000000 23 */ or %g0,%o0,%g4
! 24 ! int i;
! 25 ! for(i=0; i<5; i++)
! 26 ! a[i] = b[i] + *x;
! o2 x
! g2 *x
! g4 &a[i]
! g3 &b[i]
/* 0x0004 26 */ ld [%o2],%g2 ! <-- Load x before entering
/* 0x0008 23 */ or %g0,%o1,%g3 ! the loop.
/* 0x000c 26 */ ld [%o1],%o5
/* 0x0010 25 */ or %g0,0,%g1
.L900000108: ! The loop body starts here.
/* 0x0014 26 */ add %o5,%g2,%o5
/* 0x0018 */ st %o5,[%g4]
/* 0x001c */ add %g1,1,%g1
/* 0x0020 */ add %g3,4,%g3
/* 0x0024 */ add %g4,4,%g4
/* 0x0028 */ cmp %g1,5
/* 0x002c */ bl,a .L900000108
/* 0x0030 */ ld [%g3],%o5
.L77000022:
/* 0x0034 */ retl ! Result =
/* 0x0038 */ nop
! If array_add is compiled using the -xrestrict switch and it is
! called by the code below, it will return the wrong answer (since x
! changes value).
int main(int argv, char **argc)
{
int a[5], b[5];
int *x = &a[1];
... (More code here)
array_add(a,b,x);
}
! (End of example.)
!! PROFILING
! :Def: Profiling (a.k.a. Feedback-Directed Optimization)
! A compilation technique in which data taken from a sample run is used
! to guide compiler decisions.
! Typical Profiling Procedure
!
! (1) Compile program with profiling option.
! (2) Run program using typical input data.
! During run profile information collected for compiler.
! (3) Run compiler again, specifying where to find profile information.
! Branches occur frequently in code. There is a performance penalty
! in taking a branch and so it's best if the compiler organizes code
! (rearranges the intermediate representation) so that branches are
! not taken as much as possible. To do that the compiler needs to
! know how often an "if" or other condition (for which the branch is
! emitted) is true. Only in a few cases can the compiler figure that
! out on its own, because, for example, "if" conditions depend on
! input data. To obtain this useful information a two-step
! compilation process called profiling is used. In the first step the
! code is compiled so that it writes branch information (more
! precisely basic block, covered later) to a file. The program is run
! with typical input data, called the training input, and it writes
! the branch information to a file. In the second step the compiler
! reads the information and uses that to better organize the code.
! Sun Profiling Compiler Switches
!
! -xprofile=p
! Collects data for a profile or use a profile to optim-
! ize.
!
! p must be collect[:name], use[:name], or tcov.
!
! This option causes execution frequency data to be col-
! lected and saved during execution, then the data can be
! used in subsequent runs to improve performance. This
! option is only valid when a level of optimization is
! specified.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Use of Compiler Switches
!! Use of Compiler Switches
! ISA
! Type of system all uses have. (IA-32 for PCs, SPARC for Sun users, etc.)
! Users can't normally run code compiled for a different ISA.
! Implementation
! Type of system most users have.
! Other users can run code, but won't run as fast.
! Optimization
! Select medium or high optimization level.
! If market very sensitive to performance, use specific optimizations.