!! LSU EE 4720 -- Fall 2002 -- Computer Architecture
!
!! Compiler Optimization

!! Under Construction
!
! Time-stamp: <9 September 2002, 9:31:47 CDT, koppel@drop>

!! Contents
!
! Optimization Introduction
! Steps in Preparing a Program
! Compiler Definitions
! High-Level Optimizations
! Low-Level Optimizations
! Compiler Optimization Options

!! References
!
! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach"

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Optimization Introduction

! :HP3: Section 2.11

! :Def: Optimization
! The optional steps in compiling a program that reduce the program's
! execution time, reduce the size of the program, or both.

! :Example:
!
! Data on a program that computes pi with and without optimization.
!
! System:   SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise
! Compiler: Sun WorkShop 6 2000/04/07 C 5.1
! Clock Frequency: 300 MHz
!
! Without Optimization
!   Size              : 6408 bytes
!   Run Time          : 3.00 s
!   Instruction Count : 325,338,749
!   CPI               : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663
!
! With Optimization
!   Size              : 6340 bytes    Smaller! (OK, only a tiny bit smaller.)
!   Run Time          : 1.65 s        Faster!
!   Instruction Count : 100,338,751
!   CPI               : (/ (* 1.65 300e6) 100338751.0) = 4.9333
!

 !! Reasons *not* to Optimize
!
! Percentage shows roughly how often reason applies.
!
! 95%  It makes debugging difficult, so don't optimize while debugging.
!      This is true for almost everyone that uses a debugger.
!
! 10%  It slows down compilation.
!      Only important when there is a very large amount of code to
!      recompile.  (Back in the 20th century when computers were slow
!      this was important.)
!
!  1%  Optimization Introduces Bugs
!      It does, but very rarely.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Steps in Preparing a Program

! Typical Steps in Preparing (Building) a Program
!
!      Pre-Process, Compile, Assemble, Link, Load
! pi.c ->     pi.i  -> pi.s  ->  pi.o  -> pi
!
! These steps can all be automatically performed of the compiler
! driver. (cc, gcc, MS Visual Studio, etc.)
!
! :Sample: cc pi.c -o pi
!
! They can also be specified individually.
!
! More details appear below, compile is what we're interested in.

 !! Pre-Process
!   Load include files, expand macros.
!   Typical pre-processor name: cpp
!   Input File: pi.c
!   Output File: pi.i  (High-Level Language)

 !! Compile
!   Convert high-level language in to assembler.
!   Typical compiler name: cc1 (not cc, or gcc, or Visual Studio)
!   Input File: pi.i
!   Output File: pi.s  (Assembler)

 !! Assemble
!   Convert assembly language in to machine language.
!   Typical assembler name: as
!   Input File: pi.s
!   Output File: pi.o  (Object file containing machine code.)

 !! Link
!   Combine object files, libraries, and other code into executable.
!   Typical linker name: ld
!   Input File: pi.o
!   Output File: pi

 !! Load
!   Copy executable in to memory, link with shared libraries, and start.
!   Loader name: exec system call.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Definitions

! A program is compiled using a steps or /passes/.
!
!   The first pass reads the pre-processed high-level source code.
!
!   The last pass emits assembler code.
!
!   Between the first and last passes the program is in some 
!    intermediate representation.
!
!   The way passes are defined and organized varies greatly from
!   compiler to compiler.

! :Def: Pass
! A step in compiling a program.  A pass usually looks at an entire function.

! :Def: Intermediate Representation
! The form of a program (sort of a special-purpose language) used internally
! by a compiler.  A compiler might use several intermediate representations.

 !! Typical Passes
!
! Parse
!   Convert the source code to a high-level intermediate representation (H-IR).
!
! High-Level Optimization (Optional, may be done in several passes.)
!   Modify H-IR to improve performance or reduce code size.
!   Reads and writes H-IR
!
! Low-Level Intermediate Representation (L-IR) Generation
!   Convert H-IR to a low-level intermediate representation (L-IR).
!
! Low-Level Optimizations  (Optional, may be done in several passes.)
!   Modify L-IR to improve performance or reduce code size.
!
! Register Assignment
!   Choose machine registers.
! 
! Low-Level Optimizations  (Optional, may be done in several passes.)
!   Modify L-IR to improve performance or reduce code size.
!
! Code Generation
!   Convert L-IR to assembly code.
!
! Pigeonhole Optimizations  (Optional, may be done in several passes.)
!   Modify L-IR to improve performance or reduce code size.
!   Some of these can be done at link time.

! :Def: Compiler Front End
! The parts of the compiler that do the parsing and high-level
! optimization passes.

! :Def: Compiler Back End
! The parts of the compiler that do the low-level optimization,
! register assignment, and code generation passes.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! High-Level Optimizations

! Easy high-level optimizations presented here.

 !! Some Easy-To-Explain Front-End Optimizations
!
! Dead-Code Elimination
! Common Subexpression Elimination
! Constant Propagation, Folding

 !! Dead-Code Elimination (DCE)
!   Removal of code which isn't used.
!   Yes, it happens.

! :Example:
!
! Code benefiting from DCE
! High-level code shown for clarity.  Most compilers will transform
! an intermediate representation.
!
! Before:
!
int
main(int argv, char **argc)
{
  double i;   double sum = 0;

  for(i=1; i<50000000;)
    {
             sum = sum + 4.0 / i;   
      i += 2;
      sum = sum - 4.0 / i;   i += 2;
    }

  printf("Performed %d iterations.  Thank you for running me.\n",i);
}
!
! After:
!
int
main(int argv, char **argc)
{
  double i;

  for(i=1; i<50000000;)
    {
      i += 2;
      i += 2;
    }

  printf("Performed %d iterations.  Thank you for running me.\n",i);
}
!
! Note: Other optimizations would leave only the printf.

 !! Common Subexpression Elimination
!   Remove duplicated code.
!
! Before:
  r = ( a + b ) / ( x + y );
  s = ( a + b ) / ( x - y );
! After:
  temp = a + b;
  r = ( temp ) / ( x + y );
  s = ( temp ) / ( x - y );

 !! Constant Propagation, Folding
!   The compiler performs whatever arithmetic it can at compile time
!   rather than emitting code to perform the arithmetic at run time.
!
! Before:
  int sec_per_day = 60 * 60 * 24;
  int sec_per_year = sec_per_day * 365;
  some_routine(sec_per_day * x, sec_per_year * y);
! After:
  int sec_per_day = 86400;
  int sec_per_year = 31536000;
  some_routine(86400 * x, 31536000 * y);


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Low-Level Optimizations

! Some Low-Level Optimizations
!
! This is not a complete list.
!
! Register Assignment
! Instruction Selection
! Scheduling
!
! Additional low-level optimizations may be covered later in the semester.

! :Def: Register Assignment
! Selection of which values will be held in registers.  Values
! not held in registers are stored in memory.  
!
! Without Optimization
! All values corresponding to variables (in high-level program) are
! written to memory (not held in registers).  Intermediate results are
! held in registers.
!
! With Optimization
! Registers are assigned to as many variables as possible, with priority
! given to frequently used variables.
!
! Advantage of Optimization
! Additional instructions are needed to read from and write to memory.

! :Example:
!
! pi program without and with optimization.
!
! Optimizations include register assignment and scheduling.
!
! Without Optimization
!
!   10	    {
!   11	      sum = sum + 4.0 / i;   i += 2;

	ldd	[%fp-24],%f6   ! f6 = sum
	ldd	[%l0+0],%f4    ! f4 = 4.0
	ldd	[%fp-16],%f2   ! f2 = i
	fdivd	%f4,%f2,%f2    ! temp1 = 4.0 / i
	faddd	%f6,%f2,%f2    ! sum = sum + temp1 
	std	%f2,[%fp-24]   
	ldd	[%fp-16],%f4   ! f4 = i
	ldd	[%l0+8],%f2    ! f2 = 2.0
	faddd	%f4,%f2,%f2    ! i  = i + 2.0
	std	%f2,[%fp-16]

!   12	      sum = sum - 4.0 / i;   i += 2;

	ldd	[%fp-24],%f6
	ldd	[%l0+0],%f4
	ldd	[%fp-16],%f2
	fdivd	%f4,%f2,%f2
	fsubd	%f6,%f2,%f2
	std	%f2,[%fp-24]
	ldd	[%fp-16],%f4
	ldd	[%l0+8],%f2
	faddd	%f4,%f2,%f2
	std	%f2,[%fp-16]
	ldd	[%fp-16],%f4
	ldd	[%l0-8],%f2
	fcmped	%f4,%f2
	nop
	fbl	.L92
	nop
!
! With Optimization
!
! Register assignment has eliminated the loading and storing of
! values from memory.  
! (For example, register %f6 holding the value of sum in the
! un-optimized version above.)
!
! Scheduling has changed the order of the instructions.
!
!   10		      !    {
!   11		      !      sum = sum + 4.0 / i;   i += 2;
!
                       .L77000016:
/* 0x0020	  11 */		fdivd	%f4,%f30,%f6    ! temp1 = 4.0 / i_2
/* 0x0024	     */		faddd	%f30,%f2,%f8    ! i_1 = i_2 + 2.0

!   12		      !      sum = sum - 4.0 / i;   i += 2;

/* 0x0028	  12 */		faddd	%f8,%f2,%f30    ! i_2 = i_1 + 2.0
/* 0x002c	     */		fcmped	%f30,%f0        ! i_2 < 50000000 
/* 0x0030	     */		fdivd	%f4,%f8,%f8     ! temp2 = 4.0 / i_1
/* 0x0034	  11 */		faddd	%f10,%f6,%f6    ! sum = sum + temp1
/* 0x0038	  12 */		fbl	.L77000016
/* 0x003c	     */		fsubd	%f6,%f8,%f10    ! sum = sum - temp2

!   13		      !    }


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Optimization Options

! Under Construction

! Optimization options tell the compiler:
!
!   How much effort to put into optimization.  E.g., -O2 
!   Which particular optimizations to perform.  E.g., -fno-strength-reduce
!   What kind of system the code will be running on.  E.g., -xultra2

 !! Optimization Effort
!
! Most compilers have optimization levels.
!
! The higher the number, the more optimizations done.

! :Example:
!
! Optimization Levels for Sun Workshop 6 Compiler

!      -xO [1|2|3|4|5]
!           Optimizes the object code. Note the upper-case letter
!           O.

!           The levels (1, 2, 3, 4, or 5) you can use differ
!           according to the platform you are using.

!                ( SPARC)

!                -xO1 Does basic local optimization (peephole).

!                -xO2 Does basic local and global optimization.
!                     This is induction variable elimination, local
!                     and global common subexpression elimination,
!                     algebraic simplification, copy propagation,
!                     constant propagation, loop-invariant optimi-
!                     zation, register allocation, basic block
!                     merging, tail recursion elimination, dead
!                     code elimination, tail call elimination and
!                     complex expression expansion.

!                     The -xO2 level does not assign global, exter-
!                     nal, or indirect references or definitions to
!                     registers. It treats these references and
!                     definitions as if they were declared "vola-
!                     tile." In general, the -xO2 level results in
!                     minimum code size.

!                -xO3 Performs like -xO2 but, also optimizes refer-
!                     ences or definitions for external variables.
!                     Loop unrolling and software pipelining are
!                     also performed. The -xO3 level does not trace
!                     the effects of pointer assignments. When com-
!                     piling either device drivers, or programs
!                     that modify external variables from within
!                     signal handlers, you may need to use the
!                     volatile type qualifier to protect the object
!                     from optimization.  In general, the -xO3
!                     level results in increased code size.

!                -xO4 Performs like -xO3 but, also does automatic
!                     inlining of functions contained in the same
!                     file; this usually improves execution speed.
!                     The -xO4 level does trace the effects of
!                     pointer assignments.  In general, the -xO4
!                     level results in increased code size.

!                     If you want to control which functions are
!                     inlined, see -xinline .

!                -xO5 Generates the highest level of optimization.
!                     Uses optimization algorithms that take more
!                     compilation time or that do not have as high
!                     a certainty of improving execution time.
!                     Optimization at this level is more likely to
!                     improve performance if it is done with pro-
!                     file feedback.

 !! Particular Optimizations
!
! The levels specify sets of optimizations (like ordering the "Sport Package"
! for a new car).
!
! In contrast to optimization levels (-O3), the compiler can be told which
! particular optimizations to make.
!

! :Example:
!
! Some optimization options for gcc.

! `-ffast-math'
!      This option allows GCC to violate some ANSI or IEEE rules and/or
!      specifications in the interest of optimizing code for speed.  For
!      example, it allows the compiler to assume arguments to the `sqrt'
!      function are non-negative numbers and that no floating-point values
!      are NaNs.

! `-frerun-loop-opt'
!      Run the loop optimizer twice.

 !! Targets
!
! Specify exact type of machine code will be run on.
!
! What might be specified:
!
! ISA
! The exact instruction set used.
! Specifies not just the ISA family, but a particular variation.
! A poor choice will limit the number of machines code can run on.
!
! Implementation
! Specify the implementation code will run on.
! A poor choice will result in slower code.
!
! Cache
! Specify configuration of cache.
! Caches covered later in the semester.
! A poor choice will result in slower code.

! :Example:
!
! Switches for Sun Workshop 6 Compiler.
!
!     -xarch     Specifies ISA
!     -xchip	 Specifies implementation.
!     -xcache    Specifies cache.
!
! The following is from the Sun Workshop 6 Compiler manual.
!
!
! -xarch Values
!
!           generic   Set the parameters for the  best  performance
!                     over most 32-bit platform architectures.
!
!           v7        Compile for the SPARC-V7 ISA.

!                     Enables the compiler  to  generate  code  for
!                     good  performance  on  the  V7  ISA.  This is
!                     equivalent to using the best instruction  set
!                     for good performance on the V8 ISA, but with-
!                     out integer mul and div instructions, and the
!                     fsmuld instruction.

!           v8a       Compile for the V8a version of  the  SPARC-V8
!                     ISA.

!                     By definition, V8a  means  the  V8  ISA,  but
!                     without  the fsmuld instruction.  This option
!                     enables the compiler  to  generate  code  for
!                     good performance on the V8a ISA.

!                     Example: Any system based on the microSPARC I
!                     chip architecture

!           v8        Compile for the SPARC-V8 ISA.

!                     Enables the compiler  to  generate  code  for
!                     good performance on the V8 architecture.

!                     Example: SPARCstation 10

!      -xchip=c
!           Specifies the target processor for use by the optim-
!           izer.
!
!           This option specifies timing properties by specifying
!           the target processor.

!           Some effects are:

!           o The ordering of instructions, that is, scheduling

!           o The way the compiler uses branches

!           o The instructions to use in cases where semantically
!             equivalent alternatives are available

!           The -xchip values are:

!           generic
!                  Set the parameters for the best performance over
!                  most 32-bit platform architectures.

!           generic64
!                  Set the parameters for the best performance over
!                  most 64-bit platform architectures.

!           native Set the parameters for the best performance on
!                  the host environment (assumes a 32-bit architec-
!                  ture).

!           native64
!                  Set the parameters for the best performance on
!                  the host environment (assumes a 64-bit architec-
!                  ture).

!           old    Optimize for pre-SuperSPARC(TM) processors.

!           super  Optimize for the SuperSPARC chip.

!           super2 Optimize for the SuperSPARC II(TM) chip.

!           micro  Optimize for the microSPARC(TM) chip.

!           micro2 Optimize for the microSPARC II(TM) chip.

!           hyper  Optimize for the hyperSPARC(TM) chip.

!           hyper2 Optimize for the hyperSPARC II(TM) chip.

!           powerup
!                  Optimize for the Weitek(R) PowerUp(TM) chip.

!           ultra  Optimize for the UltraSPARC(TM) chip.

!           ultra2 Optimize for the UltraSPARC II(TM) chip.

!           ultra2i
!                  Optimize for the UltraSPARC IIi(TM) chip.

!           386    Optimize for the Intel 386 architecture.

!           486    Optimize for the Intel 486 architecture.

!           pentium
!                  Optimize for the Intel pentium architecture.

!           pentium_pro
!                  Optimize for the Intel pentium_pro architecture.