!! LSU EE 4720 -- Spring 2017 -- Computer Architecture
!! Compiler Optimization Lecture Notes

!! Contents
! Optimization Introduction
! Steps in Preparing a Program
! Compiler Definitions
! High-Level Optimizations
! Low-Level Optimizations
! Compiler Optimization Options
! Use of Compiler Switches

!! References
! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach"

!! Lecture Goals
! Understand the Program Building and Compilation Process
!  Describe steps in program building and optimization, including
!  intermediate files (assembler, object, ...) and tool names (preprocessor,
!  compiler, etc.).
! Understand Specific Optimizations and Assumption Switches
!  Describe, work example, explain benefit.
! Understand Profiling
!  Steps, how performance is improved.
! Understand ISA and Implementation Options
!  How programmer chooses them, how compiler uses them.
! Understand how Programmers Use Compilation Options
!  Normal program development, high-performance programs, SPEC disclosure.

!! Optimization Introduction

! :HP3: Section 2.11

! :Def: Optimization
! The optional steps in compiling a program that reduce the program's
! execution time, reduce the size of the program, energy consumed, etc.

! Typically, the only time optimization is NOT done is when a program
! is being debugged.
! In most cases, the programmer sets overall optimization effort (low,
! medium, high).
! When performance is very important the programmer can specify which
! specific optimizations to use.

! :Example:
! Data on a program that computes pi with and without optimization.
! System:   SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise
! Compiler: Sun WorkShop 6 2000/04/07 C 5.1
! Clock Frequency: 300 MHz
! Without Optimization
!   Size              : 6408 bytes
!   Run Time          : 3.00 s
!   Instruction Count : 325,338,749
!   CPI               : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663
! With Optimization
!   Size              : 6340 bytes    Smaller! (OK, only a tiny bit smaller.)
!   Run Time          : 1.65 s        Faster!
!   Instruction Count : 100,338,751
!   CPI               : (/ (* 1.65 300e6) 100338751.0) = 4.9333
! Comparison of Un-optimized and Optimized Runs
!  Un-optimized run takes 1.82 times longer.
!  Un-optimized run executes 3.22 times more instructions.
!  --> Execution time is not proportional to instruction count
!      We've already seen how this can be true due to stalls.
!  --> In the slower version instructions seem to be executed more efficiently.
!  Quick Explanation (See insn scheduling discussion below for code.)
!    The un-optimized version contained more easy instructions, such
!      as load and store instructions (that would hit the cache).
!    Both versions had the same number of floating-point divide instructions
!      which take a long time to execute.

 !! Reasons *not* to Optimize
! Percentage shows roughly how often reason applies.
! 95%  It makes debugging difficult, so don't optimize while debugging.
!      This is true for almost everyone that uses a debugger.
! 10%  It slows down compilation.
!      Only important when there is a very large amount of code to
!      recompile.  (Back in the 20th century when computers were slow
!      this was important.)
!  .001%  Optimization introduces bugs.
!      It does, but very rarely.

!! Steps in Building a Program

! Typical Steps in Building a Program
!      Pre-Process, Compile, Assemble, Link, Load
! pi.c ->     pi.i  -> pi.s  ->  pi.o  -> pi
! These steps can all be automatically performed of the compiler
! driver. (cc, gcc, MS Visual Studio, etc.)
! :Sample: cc pi.c -o pi
! They can also be specified individually.
! More details appear below, compile is what we're interested in.

 !! Pre-Process
!   Load include files, expand macros.
!   Typical pre-processor name: cpp
!   Input File: pi.c
!   Output File: pi.i  (High-Level Language)

 !! Compile
!   Convert high-level language in to assembler.
!   Typical compiler name: cc1 (not cc, or gcc, or Visual Studio)
!   Input File: pi.i
!   Output File: pi.s  (Assembler)

 !! Assemble
!   Convert assembly language in to machine language.
!   Typical assembler name: as
!   Input File: pi.s
!   Output File: pi.o  (Object file containing machine code.)

 !! Link
!   Combine object files, libraries, and other code into executable.
!   Typical linker name: ld
!   Input File: pi.o
!   Output File: pi

 !! Load
!   Copy executable in to memory, link with shared libraries, and start.
!   Loader name: exec system call.

!! Compiler Terminology

! A program is compiled using a steps or /passes/.
!   The first pass reads the pre-processed high-level source code.
!   The last pass emits assembler code.
!   Between the first and last passes the program is in some
!    intermediate representation.
!   The way passes are defined and organized varies greatly from
!   compiler to compiler.

! :Def: Pass
! A step in compiling a program.  A pass usually looks at an entire function.

! :Def: Intermediate Representation
! The form of a program (sort of a special-purpose language) used internally
! by a compiler.  A compiler might use several intermediate representations.

 !! Typical Passes
! Parse
!   Convert the source code to a high-level intermediate representation (H-IR).
! High-Level Optimization 
!   Optional, may be done in several passes.
!   Also called front-end optimization.
!   Modify H-IR to improve performance or reduce code size.
!   Reads and writes H-IR
! Low-Level Intermediate Representation (L-IR) Generation
!   Convert H-IR to a low-level intermediate representation (L-IR).
! Low-Level Optimization  (Optional, may be done in several passes.)
!   Also called back-end optimization.
!   Modify L-IR to improve performance or reduce code size.
! Register Assignment (Part of low-level optimization.)
!   Choose machine registers.
! Code Generation
!   Convert L-IR to assembly code.
! Pigeonhole Optimizations  (Optional, may be done in several passes.)
!   These are also called low-level optimizations.
!   Modify L-IR to improve performance or reduce code size.
!   Some of these can be done at link time.

! :Def: Compiler Front End
! The parts of the compiler that do the parsing and high-level
! optimization passes.  Computer architects are less interested
! in this part.

! :Def: Compiler Back End
! The parts of the compiler that do the low-level optimization,
! register assignment, and code generation passes. Computer
! architects are very interested in this part.

!! High-Level Optimizations

! Easy high-level optimizations presented here.

 !! Some Easy-To-Explain Front-End Optimizations
! Dead-Code Elimination (DCE)
! Common Subexpression Elimination (CSE)
! Constant Propagation, Folding

! :Def: Dead-Code Elimination (DCE)
!   Removal of code which isn't used.
!   Yes, it happens.
!   This can also be a low-level optimization.

! :Example:
! Code benefiting from DCE
! High-level code shown for clarity.  Most compilers will transform
! an intermediate representation.
! Before:
main(int argv, char **argc)
  double i;   double sum = 0;

  for(i=1; i<50000000;)
      sum = sum + 4.0 / i;   i += 2;
      sum = sum - 4.0 / i;   i += 2;

  printf("Performed %d iterations.  Thank you for running me.\n",i);
! After:
main(int argv, char **argc)
  double i;

  for(i=1; i<50000000;)
      i += 2;
      i += 2;

  printf("Performed %d iterations.  Thank you for running me.\n",i);
! Note: Other optimizations would leave only the printf.

! :Def: Common Subexpression Elimination (CSE)
!   Remove duplicated code.
! Before:
  r = ( a + b ) / ( x + y );
  s = ( a + b ) / ( x - y );
! After:
  temp = a + b;
  r = ( temp ) / ( x + y );
  s = ( temp ) / ( x - y );

! :Def: Constant Propagation, Folding
!   The compiler performs whatever arithmetic it can at compile time
!   rather than emitting code to perform the arithmetic at run time.
! Before:
  int sec_per_day = 60 * 60 * 24;
  int sec_per_year = sec_per_day * 365;
  some_routine(sec_per_day * x, sec_per_year * y);
! After:
  int sec_per_day = 86400;
  int sec_per_year = 31536000;
  some_routine(86400 * x, 31536000 * y);

!! Low-Level Optimizations

! Some Low-Level Optimizations
! This is not a complete list.
! Register Assignment
! Instruction Selection
! Scheduling

! :Def: Register Assignment
! Selection of which values will be held in registers.  Values
! not held in registers are stored in memory.
! Without Register Assignment Optimizations
! All values corresponding to variables (in high-level program) are
! written to memory (and not held in registers).  Intermediate results are
! held in registers.
! With Register Assignment Optimization
! Registers are assigned to as many variables as possible, with priority
! given to frequently used variables.
! Advantage of Register Assignment Optimization
! Fewer memory writes and reads.

! :Def: Scheduling
! Re-arranging instructions to minimize the amount of time one instruction
! has to wait for another.
! For example, if an instruction takes a long time it will be started early
! so that other instructions will not have to wait for its result.
! Scheduling will be covered in more detail later in the semester.

! :Example:
! pi program without and with optimization.
! Optimizations include register assignment and scheduling.
! Without Optimization
!   10      {
!   11        sum = sum + 4.0 / i;   i += 2;

        ldd     [%fp-24],%f6   ! f6 = sum
        ldd     [%l0+0],%f4    ! f4 = 4.0
        ldd     [%fp-16],%f2   ! f2 = i
        fdivd   %f4,%f2,%f2    ! f2 = 4.0 / i
        faddd   %f6,%f2,%f2    ! f2 = sum + (4.0/i)
        std     %f2,[%fp-24]   ! sum = f2

        ldd     [%fp-16],%f4   ! f4 = i
        ldd     [%l0+8],%f2    ! f2 = 2.0
        faddd   %f4,%f2,%f2    ! f4 = i + 2.0
        std     %f2,[%fp-16]   ! i = f4

!   12        sum = sum - 4.0 / i;   i += 2;

        ldd     [%fp-24],%f6   ! f6 = sum
        ldd     [%l0+0],%f4    ! f4 = 4.0
        ldd     [%fp-16],%f2   ! f2 = i
        fdivd   %f4,%f2,%f2    ! f2 = 4.0 / i
        fsubd   %f6,%f2,%f2    ! f2 = sum - (4.0/i)
        std     %f2,[%fp-24]   ! sum = f2

        ldd     [%fp-16],%f4   ! f4 = i
        ldd     [%l0+8],%f2    ! f2 = 2.0
        faddd   %f4,%f2,%f2    ! f2 = i + 2.0
        std     %f2,[%fp-16]   ! i = f2

        ldd     [%fp-16],%f4   ! f4 = i
        ldd     [%l0-8],%f2    ! f2 = 50000000
        fcmped  %f4,%f2        ! compare i, 50000000
        fbl     .L92           ! Branch if FP comparison less than.
! With Optimization
!   10                !    {
!   11                !      sum = sum + 4.0 / i;   i += 2;
/* 0x0020         11 */         fdivd   %f4,%f30,%f6    ! temp1 = 4.0 / i_2
/* 0x0024            */         faddd   %f30,%f2,%f8    ! i_1 = i_2 + 2.0

!   12                !      sum = sum - 4.0 / i;   i += 2;

/* 0x0028         12 */         faddd   %f8,%f2,%f30    ! i_2 = i_1 + 2.0
/* 0x002c            */         fcmped  %f30,%f0        ! i_2 < 50000000
/* 0x0030            */         fdivd   %f4,%f8,%f8     ! temp2 = 4.0 / i_1
/* 0x0034         11 */         faddd   %f10,%f6,%f6    ! sum = sum + temp1
/* 0x0038         12 */         fbl     .L77000016
/* 0x003c            */         fsubd   %f6,%f8,%f10    ! sum = sum - temp2

!   13                !    }

!! Compiler Optimization Options

! Compiler options tell the compiler:
!   How much EFFORT to put into optimization.  E.g., -O2
!   Which PARTICULAR optimizations to perform.  E.g., -fno-strength-reduce
!   The TARGET system the code will be running on.  E.g., -xultra2
!   Whether to make certain ASSUMPTIONS about the code.
!   Whether to use PROFILING data from a training run.

 !! Target Options
! Specify exact type of machine code will be run on:f ISA, Implementation, Cache
! Choice is based on type of machines customers have.

 !! Specifying the: ISA
! The exact instruction set used.
! Specifies not just the ISA family, but a particular variation.
! A poor choice will limit the number of machines code can run on.
 !! Specifying the: Implementation (Processor Core)
! Specify the implementation code will run on.
! A poor choice will result in slower code.
 !! Specifying the: Cache (Can be considered part of ISA implementation.)
! Specify configuration of cache.
! Caches covered later in the semester.
! A poor choice will result in slower code.

 !! Background
! ARM AArch64 Architecture
!  Developed by ARM.
! Some Implementations
!  cortex-a53
!  cortex-a72
!  exynos-m1

! :Example:
! Switches for GCC
!     -march     Specifies ISA
!     -mtune     Specifies the implementation.
!     -mcpu      Specifies both ISA and implementation

! Specify the name of the target processor for which GCC should tune
! the performance of the code.  Permissible values for this option
! are: 'generic', 'cortex-a53', 'cortex-a57', 'cortex-a72',
! 'exynos-m1', 'thunderx', 'xgene1'.

 !! Optimization EFFORT
! :Def: Optimization Effort
! Amount of optimization to be done. A small effort means performing
! only easy optimizations, a large effort means performing more
! time-consuming optimizations.
! Most compilers have optimization levels.
! The higher the number, the more optimizations done.

! :Example:
! Optimization Levels for Sun Workshop 6 Compiler

!      -xO [1|2|3|4|5]
!           Optimizes the object code. Note the upper-case letter
!           O.

!           The levels (1, 2, 3, 4, or 5) you can use differ
!           according to the platform you are using.

!                ( SPARC)

!                -xO1 Does basic local optimization (peephole).

!                -xO2 Does basic local and global optimization.
!                     This is induction variable elimination, local
!                     and global common subexpression elimination,
!                     algebraic simplification, copy propagation,
!                     constant propagation, loop-invariant optimi-
!                     zation, register allocation, basic block
!                     merging, tail recursion elimination, dead
!                     code elimination, tail call elimination and
!                     complex expression expansion.

!                     The -xO2 level does not assign global, exter-
!                     nal, or indirect references or definitions to
!                     registers. It treats these references and
!                     definitions as if they were declared "vola-
!                     tile." In general, the -xO2 level results in
!                     minimum code size.

!                -xO3 Performs like -xO2 but, also optimizes refer-
!                     ences or definitions for external variables.
!                     Loop unrolling and software pipelining are
!                     also performed. The -xO3 level does not trace
!                     the effects of pointer assignments. When com-
!                     piling either device drivers, or programs
!                     that modify external variables from within
!                     signal handlers, you may need to use the
!                     volatile type qualifier to protect the object
!                     from optimization.  In general, the -xO3
!                     level results in increased code size.

!                -xO4 Performs like -xO3 but, also does automatic
!                     inlining of functions contained in the same
!                     file; this usually improves execution speed.
!                     The -xO4 level does trace the effects of
!                     pointer assignments.  In general, the -xO4
!                     level results in increased code size.

!                     If you want to control which functions are
!                     inlined, see -xinline .

!                -xO5 Generates the highest level of optimization.
!                     Uses optimization algorithms that take more
!                     compilation time or that do not have as high
!                     a certainty of improving execution time.
!                     Optimization at this level is more likely to
!                     improve performance if it is done with pro-
!                     file feedback.

 !! PARTICULAR Optimizations
! The levels specify sets of optimizations (like ordering the "Sport Package"
! for a new car).
! In contrast to optimization levels (-O3), the compiler can be told which
! particular optimizations to make.
! These are typically used by skilled programmers trying to get
! fastest code.
! Some examples below.

! gcc
! `-frerun-loop-opt'
!      Run the loop optimizer twice.

! Sun Workshop 6
! Prefetch instructions read memory in advance, eliminating some cache
! misses (covered later in the semester).  They can also increase
! cache misses or increase the time to access memory, hurting
! performance.  The compiler does not know if they will help or hurt.
!      -xprefetch[=val],val
!           (SPARC) Enable prefetch instructions on those architec-
!           tures that support prefetch, such as UltraSPARC II.
!           (-xarch=v8plus, v9plusa, v9, or v9a)

!           Explicit prefetching should only be used under special
!           circumstances that are supported by measurements.

 !! ASSUMPTIONS (Assertions) About the Program

! Compilers must generate correct code.
!  That is, the code must execute in the way specified by the
!  high-level language definition.
! Correct code can be slow.
!  The compiled code might need to check for things that can happen ...
!  ... but don't in a particular program.
! Some options tell the compiler to make assumptions about the program.
!  These assumptions would not hold for every program.
!  The compiled program runs faster ...
!  ... and correctly if the assumptions are valid.

! Some switches specifying assumptions:

! gcc, Assume the program does not require strict IEEE 754 FP features.
! `-ffast-math'
!      This option allows GCC to violate some ANSI or IEEE rules and/or
!      specifications in the interest of optimizing code for speed.  For
!      example, it allows the compiler to assume arguments to the `sqrt'
!      function are non-negative numbers and that no floating-point values
!      are NaNs.

! cc (Sun Workshop 6 Compiler) Assume certain pointers do not overlap.
!      -xrestrict=f
!           (SPARC) Treats pointer-valued function parameters as
!           restricted pointers. f is a comma-separated list that
!           consists of one or more function parameters, %all,
!           %none.  This command-line option can be used on its
!           own, but is best used with optimization of -xO3 or
!           greater.
!           The default is %none. Specifying -xrestrict is
!           equivalent to specifying -xrestrict=%all.
! :Example:
! In the loop below the compiler would ordinarily load the value at x
! from memory (dereference) each iteration (five times) because the
! address of x may be the same as the address of one of the "a"
! elements.  With -xrestrict switch, x is not loaded each iteration,
! saving time and space.  The switch is needed because the compiler
! has no way of knowing if x and the a's overlap and must otherwise
! make a conservative assumption (that they overlap)
void array_add(int *a, int *b, int *x)
  int i;
  for(i=0; i<5; i++) 
    a[i] = b[i] + x[0];

! Compiled Code Without -xrestrict 

!   24                !  int i;
!   25                !  for(i=0; i<5; i++)
!   26                !    a[i] = b[i] + *x;

                                ! %g4  i
                                ! %o2  x
                                ! %g3  &a[i]
                                ! %g2  &b[i]

/* 0x0004         26 */         ld      [%o1],%o5   ! o5 = Mem[o1]
/* 0x0008         23 */         or      %g0,%o1,%g2 ! g2 = 0 + o1
/* 0x000c         25 */         or      %g0,0,%g4   ! g4 = 0 + 0

                       .L900000108:                    ! The loop body starts here.
/* 0x0010         26 */         ld      [%o2],%o4   !  <-- Load x, notice that
/* 0x0014            */         add     %g4,1,%g4   !      o2 does not change.
/* 0x0018            */         add     %g2,4,%g2   ! g2 = g2 + 4
/* 0x001c            */         cmp     %g4,5
/* 0x0020            */         add     %o5,%o4,%o5 ! o5   = b[i] + x[0]
/* 0x0024            */         st      %o5,[%g3]   ! a[i] = o5
/* 0x0028            */         add     %g3,4,%g3  
/* 0x002c            */         bl,a    .L900000108
/* 0x0030            */         ld      [%g2],%o5   ! o5 = b[i+1]
/* 0x0034            */         retl    ! Result =
/* 0x0038            */         nop

! Compiled Code With -xrestrict 

/* 000000         23 */         or      %g0,%o0,%g4

!   24                !  int i;
!   25                !  for(i=0; i<5; i++)
!   26                !    a[i] = b[i] + *x;

                                                    ! o2  x
                                                    ! g2  *x
                                                    ! g4  &a[i]
                                                    ! g3  &b[i]

/* 0x0004         26 */         ld      [%o2],%g2   ! <-- Load x before entering
/* 0x0008         23 */         or      %g0,%o1,%g3 !     the loop.
/* 0x000c         26 */         ld      [%o1],%o5
/* 0x0010         25 */         or      %g0,0,%g1
                       .L900000108:                      ! The loop body starts here.
/* 0x0014         26 */         add     %o5,%g2,%o5
/* 0x0018            */         st      %o5,[%g4]
/* 0x001c            */         add     %g1,1,%g1
/* 0x0020            */         add     %g3,4,%g3
/* 0x0024            */         add     %g4,4,%g4
/* 0x0028            */         cmp     %g1,5
/* 0x002c            */         bl,a    .L900000108
/* 0x0030            */         ld      [%g3],%o5
/* 0x0034            */         retl    ! Result =
/* 0x0038            */         nop

! If array_add is compiled using the -xrestrict switch and it is
! called by the code below, it will return the wrong answer (since x
! changes value).

int main(int argv, char **argc)
  int a[5], b[5];
  int *x = &a[1];
  ... (More code here)

! (End of example.)


! :Def: Profiling  (a.k.a. Feedback-Directed Optimization (FDO))
! A compilation technique in which data taken from a sample run is used
! to guide compiler decisions.

 !! Typical Profiling Procedure
! (1) Compile program with profiling option.
! (2) Run program using typical input data.
!     During run profile information collected for compiler.
! (3) Run compiler again, specifying where to find profile information.

 !! Unoptimized
 bneq r1, r2  IFPART

 sub r9, r10, r11

 add r3, r4, r5

 xor r6, r7, r8

 !! Optimized after profiling

 bneq r1, r2  IFPART
 sub r9, r10, r11

 xor r6, r7, r8
 !! Lots of additional code.

 !! Somewhere far away.
 add r3, r4, r5

! Branches occur frequently in code.  There is a performance penalty
! in taking a branch and so it's best if the compiler organizes code
! (rearranges the intermediate representation) so that branches are
! not taken as much as possible.  To do that the compiler needs to
! know how often an "if" or other condition (for which the branch is
! emitted) is true.  Only in a few cases can the compiler figure that
! out on its own, because, for example, "if" conditions depend on
! input data.  To obtain this useful information a two-step
! compilation process called profiling is used.  In the first step the
! code is compiled so that it writes branch information (more
! precisely basic block, covered later) to a file.  The program is run
! with typical input data, called the training input, and it writes
! the branch information to a file.  In the second step the compiler
! reads the information and uses that to better organize the code.

 !! Sun Profiling Compiler Switches
!      -xprofile=p
!           Collects data for a profile or use a profile to optim-
!           ize.
!           p must be collect[:name], use[:name], or tcov.
!           This option causes execution frequency data to be col-
!           lected and saved during execution, then the data can be
!           used in subsequent runs to improve performance. This
!           option is only valid when a level of optimization is
!           specified.

 !! GCC Profiling Compiler Switches
! `-fprofile-generate'
! `-fprofile-generate=PATH'
!      Enable options usually used for instrumenting application to
!      produce profile useful for later recompilation with profile
!      feedback based optimization.  You must use `-fprofile-generate'
!      both when compiling and when linking your program.
!      The following options are enabled: `-fprofile-arcs',
!      `-fprofile-values', `-fvpt'.
!      If PATH is specified, GCC will look at the PATH to find the
!      profile feedback data files. See `-fprofile-dir'.
! `-fprofile-use'
! `-fprofile-use=PATH'
!      Enable profile feedback directed optimizations, and optimizations
!      generally profitable only with profile feedback available.
!      The following options are enabled: `-fbranch-probabilities',
!      `-fvpt', `-funroll-loops', `-fpeel-loops', `-ftracer'
!      By default, GCC emits an error message if the feedback profiles do
!      not match the source code.  This error can be turned into a
!      warning by using `-Wcoverage-mismatch'.  Note this may result in
!      poorly optimized code.
!      If PATH is specified, GCC will look at the PATH to find the
!      profile feedback data files. See `-fprofile-dir'.
! `-fbranch-probabilities'
!      After running a program compiled with `-fprofile-arcs' (*note
!      Options for Debugging Your Program or `gcc': Debugging Options.),
!      you can compile it a second time using `-fbranch-probabilities',
!      to improve optimizations based on the number of times each branch
!      was taken.  When the program compiled with `-fprofile-arcs' exits
!      it saves arc execution counts to a file called `SOURCENAME.gcda'
!      for each source file.  The information in this data file is very
!      dependent on the structure of the generated code, so you must use
!      the same source code and the same optimization options for both
!      compilations.
!      With `-fbranch-probabilities', GCC puts a `REG_BR_PROB' note on
!      each `JUMP_INSN' and `CALL_INSN'.  These can be used to improve
!      optimization.  Currently, they are only used in one place: in
!      `reorg.c', instead of guessing which path a branch is mostly to
!      take, the `REG_BR_PROB' values are used to exactly determine which
!      path is taken more often.
! `-fprofile-values'
!      If combined with `-fprofile-arcs', it adds code so that some data
!      about values of expressions in the program is gathered.
!      With `-fbranch-probabilities', it reads back the data gathered
!      from profiling values of expressions and adds `REG_VALUE_PROFILE'
!      notes to instructions for their later usage in optimizations.
!      Enabled with `-fprofile-generate' and `-fprofile-use'.
! `-fvpt'
!      If combined with `-fprofile-arcs', it instructs the compiler to add
!      a code to gather information about values of expressions.
!      With `-fbranch-probabilities', it reads back the data gathered and
!      actually performs the optimizations based on them.  Currently the
!      optimizations include specialization of division operation using
!      the knowledge about the value of the denominator.

!! Use of Compiler Switches

!   Type of system all uses have. (IA-32 for PCs, SPARC for Sun users, etc.)
!   Users can't normally run code compiled for a different ISA.

! Implementation
!   Type of system most users have.
!   Other users can run code, but won't run as fast.

! Optimization
!   Select medium or high optimization level.
!   If market very sensitive to performance, use specific optimizations.