lopt.s

!! LSU EE 4720 -- Spring 2025 -- Computer Architecture
!
!! Compiler Optimization Lecture Notes

!! Integrate this ref: CACM Feb 2020 p 41. Opt in C++

!! Contents
!
! Optimization Introduction
! Steps in Preparing a Program
! Compiler Definitions
! High-Level Optimizations
! Low-Level Optimizations
! Compiler Optimization Options
! Use of Compiler Switches

!! References
!
! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach"

!! Lecture Goals
!
! Understand the Program Building and Compilation Process
!  Describe steps in program building and optimization, including
!  intermediate files (assembler, object, ...) and tool names (preprocessor,
!  compiler, etc.).
!
! Understand Specific Optimizations and Assumption Switches
!  Describe, work example, explain benefit.
!
! Understand Profiling
!  Steps, how performance is improved.
!
! Understand ISA and Implementation Options
!  How programmer chooses them, how compiler uses them.
!
! Understand how Programmers Use Compilation Options
!  Normal program development, high-performance programs, SPEC disclosure.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Optimization Introduction

! :HP3: Section 2.11

! :Def: Optimization
! The optional steps in compiling a program that reduce the program's
! execution time, reduce the size of the program, energy consumed, etc.

! Typically, the only time optimization is NOT done is when a program
! is being debugged.
!
! In most cases, the programmer sets overall optimization effort (low,
! medium, high).
!
! When performance is very important the programmer can specify which
! specific optimizations to use.

! Commands (Linux)
! Laptop: taskset 0xff perf stat ./pi
! taskset 0xff perf stat --log-fd 1 -e instructions ./pi  | grep instr

! :Example:
!
! Data on a program that computes π with and without optimization.
!
! OS: Linux 6.8.5-301.fc40.x86_64
! Hardware: 13th Gen Intel(R) Core(TM) i9-13950HX
! Compiler: gcc version 14.0.1 20240411 (Red Hat 14.0.1-0) (GCC) 
! 
!
! Without Optimization
!   Clock             : 5.171 GHz
!   Instruction Count : 575,160,252
!   Run Time          : 79.21 ms
!   Insn Throughput   : 7.261 G insn/s, 1.404 insn/cyc
!
! With Optimization
!   Clock             : 5.309 GHz
!   Instruction Count : 250,160,137
!   Run Time          : Let's predict.

!   Predict Run Time: 
!   (* 79.21 (/ 250160137 575160252.0) (/ 5.171 5.309));  = 33.5561 
!   Measured:
!   Run Time          : 37.18 ms
!   (/ 250160137.0 (* 5.309 37.18 .001 1000000000.0));  = 1.2673 

! Commands:
!   gcc pi.c -o pi-noopt -O0
!   gcc pi.c -o pi -O3
!   taskset 0xff perf stat ./pi 
!   taskset 0xff perf stat ./pi-noopt
! Programs:
!   gcc: GNU C Compiler (more correct: GNU Compiler Collection)
!   taskset: Utility to run code on certain cores. Fast ones in this case.
!   perf: Utility to collect performance data.

! :Example:
!
! Data on a program that computes π with and without optimization.
!
! System:   SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise
! Compiler: Sun WorkShop 6 2000/04/07 C 5.1
! Clock Frequency: 300 MHz
!
! Without Optimization
!   Size              : 6408 bytes
!   Instruction Count : 325,338,749
!   Run Time          : 3.00 s
!   CPI               : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663
!   IPC               : (/ 325338749.0 (* 3.00 300e6) ) = 0.3615 
!

! With Optimization
!   Size              : 6340 bytes    Smaller! (OK, only a tiny bit smaller.)
!   Instruction Count : 100,338,751   Just 1/3 the instructions!!

!   Run Time: Let's predict this.


!   Run Time          : 1.65 s        Half the speed. That's it?
!   CPI               : (/ (* 1.65 300e6) 100338751.0) = 4.9333
!   IPC               : (/ 100338751.0 (* 1.65 300e6)) = 0.2027 
!
! Comparison of Un-optimized and Optimized Runs
!
!  Un-optimized run takes 1.82 times longer.
!  Un-optimized run executes 3.22 times more instructions.
!
!  --> Execution time is not proportional to instruction count
!      We've already seen how this can be true due to stalls.
!
!  --> In the slower version ..
!      .. instructions seem to be executed more efficiently [sic].
!
!  Quick Explanation (See insn scheduling discussion below for code.)
!
!    The un-optimized version contained more easy instructions, such
!      as load and store instructions (that would hit the cache).
!
!    Both versions had the same number of floating-point divide instructions
!      which take a long time to execute.


 !! Reasons *not* to Optimize
!
! Percentage shows roughly how often reason applies.
!
! 95%  It makes debugging difficult, so don't optimize while debugging.
!      This is true for almost everyone that uses a debugger.
!
! 10%  It slows down compilation.
!      Only important when there is a very large amount of code to
!      recompile.  (Back in the 20th century when computers were slow
!      this was important.)
!
!  .001%  Optimization introduces bugs.
!      It does, but very rarely.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Steps in Building a Program

! Typical Steps in Building a Program
!
!      Pre-Process, Compile, Assemble, Link, Load
! pi.c ->     pi.i  -> pi.s  ->  pi.o  -> pi
!
! These steps can all be automatically performed of the compiler
! driver. (cc, gcc, MS Visual Studio, etc.)
!
! :Sample: cc pi.c -o pi

!
! They can also be specified individually.
!
! More details appear below, compile is what we're interested in.

 !! Pre-Process
!   Load include files, expand macros.
!   Typical pre-processor name: cpp
!   Input File: pi.c
!   Output File: pi.i  (High-Level Language)

 !! Compile
!   Convert high-level language in to assembler.
!   Typical compiler name: cc1 (not cc, or gcc, or Visual Studio)
!   Input File: pi.i
!   Output File: pi.s  (Assembler)

 !! Assemble
!   Convert assembly language in to machine language.
!   Typical assembler name: as
!   Input File: pi.s
!   Output File: pi.o  (Object file containing machine code.)

 !! Link
!   Combine object files, libraries, and other code into executable.
!   Typical linker name: ld
!   Input File: pi.o
!   Output File: pi

 !! Load
!   Copy executable in to memory, link with shared libraries, and start.
!   Loader name: exec system call.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Terminology

! A program is compiled using a steps or /passes/.
!
!   The first pass reads the pre-processed high-level source code.
!
!   The last pass emits assembler code.
!
!   Between the first and last passes the program is in some
!    intermediate representation.
!
!   The way passes are defined and organized varies greatly from
!   compiler to compiler.

! :Def: Pass
! A step in compiling a program.  A pass usually looks at an entire function.

! :Def: Intermediate Representation
! The form of a program (sort of a special-purpose language) used internally
! by a compiler.  A compiler might use several intermediate representations.

 !! Typical Passes
!
! Parse
!   Convert the source code to a high-level intermediate representation (H-IR).
!
! High-Level Optimization 
!   Optional, may be done in several passes.
!   Also called front-end optimization.
!   Modify H-IR to improve performance or reduce code size.
!   Reads and writes H-IR
!
! Low-Level Intermediate Representation (L-IR) Generation
!   Convert H-IR to a low-level intermediate representation (L-IR).
!
! Low-Level Optimization  (Optional, may be done in several passes.)
!   Also called back-end optimization.
!   Modify L-IR to improve performance or reduce code size.
!
! Register Assignment (Part of low-level optimization.)
!   Choose machine registers.
!
! Code Generation
!   Convert L-IR to assembly code.
!
! Pigeonhole Optimizations  (Optional, may be done in several passes.)
!   These are also called low-level optimizations.
!   Modify L-IR to improve performance or reduce code size.
!   Some of these can be done at link time.

! :Def: Compiler Front End
! The parts of the compiler that do the parsing and high-level
! optimization passes.  Computer architects are less interested
! in this part.

! :Def: Compiler Back End
! The parts of the compiler that do the low-level optimization,
! register assignment, and code generation passes. Computer
! architects are very interested in this part.


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! High-Level Optimizations

! Easy high-level optimizations presented here.

 !! Some Easy-To-Explain Front-End Optimizations
!
! Dead-Code Elimination (DCE)
! Common Subexpression Elimination (CSE)
! Constant Propagation, Folding

! :Def: Dead-Code Elimination (DCE)
!   Removal of code which isn't used.
!   Yes, it happens.
!   This can also be a low-level optimization.

! :Example:

  ! Before
  x = a + b;
  x = c + d;

  ! After
  x = c + d;


! :Example:
!
! Code benefiting from DCE
! High-level code shown for clarity.  Most compilers will transform
! an intermediate representation.
!
! Before:
!
int
main(int argv, char **argc)
{
  double i;   double sum = 0;

  for(i=1; i<50000000;)
    {
      sum = sum + 4.0 / i;   i += 2;
      sum = sum - 4.0 / i;   i += 2;
    }

  printf("Performed %d iterations.  Thank you for running me.\n",i);
}
!
! After:
!
int
main(int argv, char **argc)
{
  double i;

  for(i=1; i<50000000;)
    {
      i += 2;
      i += 2;
    }

  printf("Performed %d iterations.  Thank you for running me.\n",i);
}
!
! Note: Other optimizations would leave only the printf.

! :Def: Common Subexpression Elimination (CSE)
!   Remove duplicated code.
!
! Before:
  r = ( a + b ) / ( x + y );
  s = ( a + b ) / ( x - y );
! After:
  temperature = a + b;
  r = ( temperature ) / ( x + y );
  s = ( temperature ) / ( x - y );

! :Def: Constant Propagation, Folding
!   The compiler performs whatever arithmetic it can at compile time
!   rather than emitting code to perform the arithmetic at run time.
!
! Before:
  int sec_per_day = 60 * 60 * 24;
  int sec_per_year = sec_per_day * 365;
  some_routine(sec_per_day * x, sec_per_year * y);
! After:
  int sec_per_day = 86400;
  int sec_per_year = 31536000;
  some_routine(86400 * x, 31536000 * y);


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Low-Level Optimizations

! Some Low-Level Optimizations
!
! This is not a complete list.
!
! Register Assignment
! Instruction Selection
! Scheduling

! :Def: Register Assignment
! Selection of which values will be held in registers.  Values
! not held in registers are stored in memory.
!
! Without Register Assignment Optimizations
! All values corresponding to variables (in high-level program) are
! written to memory (and not held in registers).  Intermediate results are
! held in registers.
!
! With Register Assignment Optimization
! Registers are assigned to as many variables as possible, with priority
! given to frequently used variables.
!
! Advantage of Register Assignment Optimization
! Fewer memory writes and reads.


 !! Instruction Selection



! :Def: Scheduling
! Re-arranging instructions to minimize the amount of time one instruction
! has to wait for another.
!
! For example, if an instruction takes a long time it will be started early
! so that other instructions will not have to wait for its result.
!
! Scheduling will be covered in more detail later in the semester.

! :Example:
!
! pi program without and with optimization.
!
! Optimizations include register assignment and scheduling.
!
! Without Optimization
!
!   10      {
!   11        sum = sum + 4.0 / i;   i += 2;

        ldd     [%fp-24],%f6   ! f6 = sum
        ldd     [%l0+0],%f4    ! f4 = 4.0
        ldd     [%fp-16],%f2   ! f2 = i
        fdivd   %f4,%f2,%f2    ! f2 = 4.0 / i
        faddd   %f6,%f2,%f2    ! f2 = sum + (4.0/i)
        std     %f2,[%fp-24]   ! sum = f2

        ldd     [%fp-16],%f4   ! f4 = i
        ldd     [%l0+8],%f2    ! f2 = 2.0
        faddd   %f4,%f2,%f2    ! f4 = i + 2.0
        std     %f2,[%fp-16]   ! i = f4

!   12        sum = sum - 4.0 / i;   i += 2;

        ldd     [%fp-24],%f6   ! f6 = sum
        ldd     [%l0+0],%f4    ! f4 = 4.0
        ldd     [%fp-16],%f2   ! f2 = i
        fdivd   %f4,%f2,%f2    ! f2 = 4.0 / i
        fsubd   %f6,%f2,%f2    ! f2 = sum - (4.0/i)
        std     %f2,[%fp-24]   ! sum = f2

        ldd     [%fp-16],%f4   ! f4 = i
        ldd     [%l0+8],%f2    ! f2 = 2.0
        faddd   %f4,%f2,%f2    ! f2 = i + 2.0
        std     %f2,[%fp-16]   ! i = f2

        ldd     [%fp-16],%f4   ! f4 = i
        ldd     [%l0-8],%f2    ! f2 = 50000000
        fcmped  %f4,%f2        ! compare i, 50000000
        nop
        fbl     .L92           ! Branch if FP comparison less than.
        nop

!
! With Optimization
!
!   10                !    {
!   11                !      sum = sum + 4.0 / i;   i += 2;
!
                       .L77000016:
/* 0x0020         11 */         fdivd   %f4,%f30,%f6    ! temp1 = 4.0 / i_2
/* 0x0024            */         faddd   %f30,%f2,%f8    ! i_1 = i_2 + 2.0

!   12                !      sum = sum - 4.0 / i;   i += 2;

/* 0x0028         12 */         faddd   %f8,%f2,%f30    ! i_2 = i_1 + 2.0
/* 0x002c            */         fcmped  %f30,%f0        ! i_2 < 50000000
/* 0x0030            */         fdivd   %f4,%f8,%f8     ! temp2 = 4.0 / i_1
/* 0x0034         11 */         faddd   %f10,%f6,%f6    ! sum = sum + temp1
/* 0x0038         12 */         fbl     .L77000016
/* 0x003c            */         fsubd   %f6,%f8,%f10    ! sum = sum - temp2

!   13                !    }



!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Compiler Optimization Options

! Compiler options tell the compiler:
!
!   How much EFFORT to put into optimization.  E.g., -O2
!   Which PARTICULAR optimizations to perform.  E.g., -fstrength-reduce
!   The TARGET system the code will be running on.  E.g., -xultra2
!   Whether to make certain ASSUMPTIONS about the code.
!   Whether to use PROFILING data from a training run.


 !! Target Options
!
! Specify exact type of machine code will be run on: ISA, Implementation, Cache
!
! Choice is based on type of machines customers have.

!
!
 !! Specifying the: ISA
!
! The exact instruction set used.
! Specifies not just the ISA family, but a particular variation.
! A poor choice will limit the number of machines code can run on.
!
 !! Specifying the: Implementation (Processor Core)
!
! Specify the implementation code will run on.
! A poor choice will result in slower code.
!

 !! Background
!
! ARM AArch64 Architecture
!
!  Developed by ARM.
!
! Some Implementations
!
!  cortex-a53
!  cortex-a72
!  exynos-m1

! :Example:
!
! Switches for GCC
!
!     -march     Specifies ISA
!     -mtune     Specifies the implementation.
!
!     -mcpu      Specifies both ISA and implementation

! Specify the name of the target processor for which GCC should tune
! the performance of the code.  Permissible values for this option
! are: 'generic', 'cortex-a53', 'cortex-a57', 'cortex-a72',
! 'exynos-m1', 'thunderx', 'xgene1'.



 !! Optimization EFFORT
!
! :Def: Optimization Effort
! Amount of optimization to be done. A small effort means performing
! only easy optimizations, a large effort means performing more
! time-consuming optimizations.
!
! Most compilers have optimization levels.
!
! The higher the number, the more optimizations done.

! :Example:
!
! Optimization Levels for gcc Version 9:

!   '-O'
!   '-O1'
!        Optimize.  Optimizing compilation takes somewhat more time, and a
!        lot more memory for a large function.
!   
!        With '-O', the compiler tries to reduce code size and execution
!        time, without performing any optimizations that take a great deal
!        of compilation time.
!   
!   '-O2'
!        Optimize even more.  GCC performs nearly all supported
!        optimizations that do not involve a space-speed tradeoff.  As
!        compared to '-O', this option increases both compilation time and
!        the performance of the generated code.
!   
!   
!   '-O3'
!        Optimize yet more.  '-O3' turns on all optimizations specified by
!        '-O2' and also turns on the
!        [snip]
!   
!   '-O0'
!        Reduce compilation time and make debugging produce the expected
!        results.  This is the default.
!   
!   '-Os'
!        Optimize for size.  '-Os' enables all '-O2' optimizations that do
!        not typically increase code size.  It also performs further
!        optimizations designed to reduce code size.
!   
!   '-Ofast'
!        Disregard strict standards compliance.  '-Ofast' enables all '-O3'
!        optimizations.  It also enables optimizations that are not valid
!        for all standard-compliant programs.  It turns on '-ffast-math' and
!        the Fortran-specific '-fno-protect-parens' and '-fstack-arrays'.
!   
!   '-Og'
!        Optimize debugging experience.  '-Og' enables optimizations that do
!        not interfere with debugging.  It should be the optimization level
!        of choice for the standard edit-compile-debug cycle, offering a
!        reasonable level of optimization while maintaining fast compilation
!        and a good debugging experience.


 !! PARTICULAR Optimizations
!
! The levels specify sets of optimizations (like ordering the "Sport Package"
! for a new car).
!
! In contrast to optimization levels (-O3), the compiler can be told which
! particular optimizations to make.
!
! These are typically used by skilled programmers trying to get
! fastest code.
!
! Some examples below.


! gcc
!   
!   '-fschedule-insns'
!        If supported for the target machine, attempt to reorder
!        instructions to eliminate execution stalls due to required data
!        being unavailable.  This helps machines that have slow floating
!        point or memory load instructions by allowing other instructions to
!        be issued until the result of the load or floating-point
!        instruction is required.
!   
!
!   `-frerun-loop-opt'
!        Run the loop optimizer twice.
!   
!   '-fprefetch-loop-arrays'
!        If supported by the target machine, generate instructions to
!        prefetch memory to improve the performance of loops that access
!        large arrays.
!   
!        This option may generate better or worse code; results are highly
!        dependent on the structure of loops within the source code.




 !! ASSUMPTIONS (Assertions) About the Program

! Compilers must generate correct code.
!
!  That is, the code must execute in the way specified by the
!  high-level language definition.
!
! Correct code can be slow.
!
!  The compiled code might need to check for things that can happen ...
!  ... but don't in a particular program.
!
! Some options tell the compiler to make assumptions about the program.
!
!  These assumptions would not hold for every program.
!
!  The compiled program runs faster ...
!  ... and correctly if the assumptions are valid.

! Some switches specifying assumptions:

! gcc, Assume the program does not require strict IEEE 754 FP features.
! `-ffast-math'
!      This option allows GCC to violate some ANSI or IEEE rules and/or
!      specifications in the interest of optimizing code for speed.  For
!      example, it allows the compiler to assume arguments to the `sqrt'
!      function are non-negative numbers and that no floating-point values
!      are NaNs.


! cc (Sun Workshop 6 Compiler) Assume certain pointers do not overlap.
!
!      -xrestrict=f
!           (SPARC) Treats pointer-valued function parameters as
!           restricted pointers. f is a comma-separated list that
!           consists of one or more function parameters, %all,
!           %none.  This command-line option can be used on its
!           own, but is best used with optimization of -xO3 or
!           greater.
!
!           The default is %none. Specifying -xrestrict is
!           equivalent to specifying -xrestrict=%all.
!
! :Example:
!
! In the loop below the compiler would ordinarily load the value at x
! from memory (dereference) each iteration (five times) because the
! address of x may be the same as the address of one of the "a"
! elements.  With -xrestrict switch, x is not loaded each iteration,
! saving time and space.  The switch is needed because the compiler
! has no way of knowing if x and the a's overlap and must otherwise
! make a conservative assumption (that they overlap)
void array_add(int *a, int *b, int *x)
{
  ! Suppose: &a[0] = 0x1000,  &x[0] = 0x2000  // Normal.
  !          &a[1] = 0x1004,  
  !          &a[2] = 0x1008, ..

  ! Suppose: &a[0] = 0x1000,  &x[0] = 0x1004  // Huh?!?
  !          &a[1] = 0x1004,  
  !          &a[2] = 0x1008, ..

  int i;
  for(i=0; i<5; i++) 
    a[i] = b[i] + x[0];

  ! Plan B
  auto xzero = x[0];
  for(i=0; i<5; i++) 
    a[i] = b[i] + xzero;


}

! Compiled Code Without -xrestrict 

!   24                !  int i;
!   25                !  for(i=0; i<5; i++)
!   26                !    a[i] = b[i] + x[0];

                                ! %g4  i
                                ! %o2  x
                                ! %g3  &a[i]
                                ! %g2  &b[i]


/* 0x0004         26 */         ld      [%o1],%o5   ! o5 = Mem[o1]
/* 0x0008         23 */         or      %g0,%o1,%g2 ! g2 = 0 + o1
/* 0x000c         25 */         or      %g0,0,%g4   ! g4 = 0 + 0

                       .L900000108:                    ! The loop body starts here.
/* 0x0010         26 */         ld      [%o2],%o4   !  <-- Load x, notice that
/* 0x0014            */         add     %g4,1,%g4   !      o2 does not change.
/* 0x0018            */         add     %g2,4,%g2   ! g2 = g2 + 4
/* 0x001c            */         cmp     %g4,5
/* 0x0020            */         add     %o5,%o4,%o5 ! o5   = b[i] + x[0]
/* 0x0024            */         st      %o5,[%g3]   ! a[i] = o5
/* 0x0028            */         add     %g3,4,%g3  
/* 0x002c            */         bl,a    .L900000108
/* 0x0030            */         ld      [%g2],%o5   ! o5 = b[i+1]
                       .L77000022:
/* 0x0034            */         retl    ! Result =
/* 0x0038            */         nop


! Compiled Code With -xrestrict 

/* 000000         23 */         or      %g0,%o0,%g4

!   24                !  int i;
!   25                !  for(i=0; i<5; i++)
!   26                !    a[i] = b[i] + *x;

                                                    ! o2  x
                                                    ! g2  *x
                                                    ! g4  &a[i]
                                                    ! g3  &b[i]


/* 0x0004         26 */         ld      [%o2],%g2   ! <-- Load x before entering
/* 0x0008         23 */         or      %g0,%o1,%g3 !     the loop.
/* 0x000c         26 */         ld      [%o1],%o5
/* 0x0010         25 */         or      %g0,0,%g1
                       .L900000108:                      ! The loop body starts here.
/* 0x0014         26 */         add     %o5,%g2,%o5
/* 0x0018            */         st      %o5,[%g4]
/* 0x001c            */         add     %g1,1,%g1
/* 0x0020            */         add     %g3,4,%g3
/* 0x0024            */         add     %g4,4,%g4
/* 0x0028            */         cmp     %g1,5
/* 0x002c            */         bl,a    .L900000108
/* 0x0030            */         ld      [%g3],%o5
                       .L77000022:
/* 0x0034            */         retl    ! Result =
/* 0x0038            */         nop


! If array_add is compiled using the -xrestrict switch and it is
! called by the code below, it will return the wrong answer (since x
! changes value).

int main(int argv, char **argc)
{
  int a[5], b[5];
  int *x = &a[1];
  ... (More code here)
  array_add(a,b,x);
}

! (End of example.)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Profiling
!!

! :Def: Profiling  (a.k.a. Feedback-Directed Optimization (FDO))
! The techniques used to optimize a program using statistics collected
! from a training run of the program.

 !! Typical Profiling Procedure
!
! (1) Compile program with profiling option.
!
! (2) Run program using typical input data.
!     This run is called the training run and the input data is
!     called the training data.
!     During the training run statistics are collected.
!
! (3) Run compiler again, specifying where to find the statistics.


! :Example:
!
! Optimization of a simple if/else statement:
!
!   if ( r1 != r2 )  // Mostly r1 equals r2.
!     { IFPART:   r3 = r4 + r5; } 
!   else
!     { ELSEPART: r9 = r10 - r11; }  // More likely path
!
!   CONTINUE: r6 = r7 ^ r8;
!
! On the target there is a large branch penalty ..
! .. and so the optimizer should try to ..
! .. minimize the number of *taken* branches.

 !!
 !! Unoptimized Code
 !!
 bne r1, r2  IFPART
 nop

ELSEPART:
 j CONTINUE;
 sub r9, r10, r11

IFPART:
 add r3, r4, r5

CONTINUE:
 xor r6, r7, r8

! When r1==r2: 4 insn, 1 taken control transfer. More likely
! When r1!=r2: 3 insn, 1 taken control transfer. 

!
! In the unoptimized code above ..
! .. there is one taken control transfer (bne or j) ..
! .. whether or not r1 == r2.

 
 !!
 !! Optimized Using Profile Data
 !!
 !  Compiler now "knows" that r1==r2 is the common case.

 bne r1, r2  IFPART
 nop
ELSEPART:
 sub r9, r10, r11

CONTINUE:
 xor r6, r7, r8
 
 !! Lots of additional code.


 !! Somewhere far away.
IFPART:
 add r3, r4, r5
 j CONTINUE;
 nop
!
! In the profile-optimized code above ..
! .. for the common case, r1 == r2, there are zero taken control transfers ..
! .. and for the uncommon case, there are two (bne and j).



! Branches occur frequently in code. There is a performance penalty in
! taking a branch and so it's best if the compiler organizes code
! (rearranges the intermediate representation) so that branches are
! not taken as much as possible. To do that the compiler needs to know
! how often an "if" or other condition (for which the branch is
! emitted) is true. Only in a few cases can the compiler figure that
! out on its own, because, for example, "if" conditions that depend on
! input data. To obtain this useful information a two-step compilation
! process called profiling is used. In the first step the code is
! compiled so that it writes branch information (more precisely basic
! block, covered later) to a file. The program is run with typical
! input data, called the training input, and it writes the branch
! information to a file. In the second step the compiler reads the
! information and uses that to better organize the code.

 !! GCC Profiling Compiler Switches

! `-fprofile-generate'
! `-fprofile-generate=PATH'
!      Enable options usually used for instrumenting application to
!      produce profile useful for later recompilation with profile
!      feedback based optimization.  You must use `-fprofile-generate'
!      both when compiling and when linking your program.
! 
!      The following options are enabled: `-fprofile-arcs',
!      `-fprofile-values', `-fvpt'.
! 
!      If PATH is specified, GCC will look at the PATH to find the
!      profile feedback data files. See `-fprofile-dir'.
! 
! `-fprofile-use'
! `-fprofile-use=PATH'
!      Enable profile feedback directed optimizations, and optimizations
!      generally profitable only with profile feedback available.
! 
!      The following options are enabled: `-fbranch-probabilities',
!      `-fvpt', `-funroll-loops', `-fpeel-loops', `-ftracer'
! 
!      By default, GCC emits an error message if the feedback profiles do
!      not match the source code.  This error can be turned into a
!      warning by using `-Wcoverage-mismatch'.  Note this may result in
!      poorly optimized code.
! 
!      If PATH is specified, GCC will look at the PATH to find the
!      profile feedback data files. See `-fprofile-dir'.
!
! `-fbranch-probabilities'
!      After running a program compiled with `-fprofile-arcs' (*note
!      Options for Debugging Your Program or `gcc': Debugging Options.),
!      you can compile it a second time using `-fbranch-probabilities',
!      to improve optimizations based on the number of times each branch
!      was taken.  When the program compiled with `-fprofile-arcs' exits
!      it saves arc execution counts to a file called `SOURCENAME.gcda'
!      for each source file.  The information in this data file is very
!      dependent on the structure of the generated code, so you must use
!      the same source code and the same optimization options for both
!      compilations.
! 
!      With `-fbranch-probabilities', GCC puts a `REG_BR_PROB' note on
!      each `JUMP_INSN' and `CALL_INSN'.  These can be used to improve
!      optimization.  Currently, they are only used in one place: in
!      `reorg.c', instead of guessing which path a branch is mostly to
!      take, the `REG_BR_PROB' values are used to exactly determine which
!      path is taken more often.
! 
! `-fprofile-values'
!      If combined with `-fprofile-arcs', it adds code so that some data
!      about values of expressions in the program is gathered.
! 
!      With `-fbranch-probabilities', it reads back the data gathered
!      from profiling values of expressions and adds `REG_VALUE_PROFILE'
!      notes to instructions for their later usage in optimizations.
! 
!      Enabled with `-fprofile-generate' and `-fprofile-use'.
! 
! `-fvpt'
!      If combined with `-fprofile-arcs', it instructs the compiler to add
!      a code to gather information about values of expressions.
! 
!      With `-fbranch-probabilities', it reads back the data gathered and
!      actually performs the optimizations based on them.  Currently the
!      optimizations include specialization of division operation using
!      the knowledge about the value of the denominator.







!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! Use of Compiler Switches

 !! Use of Compiler Switches

! ISA
!   Type of system all uses have. (Intel 64 for PCs, SPARC for Sun users, etc.)
!   Users can't normally run code compiled for a different ISA.

! Implementation
!   Type of system most users have.
!   Other users can run code, but won't run as fast.

! Optimization
!   Select medium or high optimization level.
!   If market very sensitive to performance, use specific optimizations.