!! LSU EE 4720 -- Spring 2025 -- Computer Architecture ! !! Compiler Optimization Lecture Notes !! Integrate this ref: CACM Feb 2020 p 41. Opt in C++ !! Contents ! ! Optimization Introduction ! Steps in Preparing a Program ! Compiler Definitions ! High-Level Optimizations ! Low-Level Optimizations ! Compiler Optimization Options ! Use of Compiler Switches !! References ! ! :HP3: Hennessy & Patterson, "Computer Architecture, a Quantitative Approach" !! Lecture Goals ! ! Understand the Program Building and Compilation Process ! Describe steps in program building and optimization, including ! intermediate files (assembler, object, ...) and tool names (preprocessor, ! compiler, etc.). ! ! Understand Specific Optimizations and Assumption Switches ! Describe, work example, explain benefit. ! ! Understand Profiling ! Steps, how performance is improved. ! ! Understand ISA and Implementation Options ! How programmer chooses them, how compiler uses them. ! ! Understand how Programmers Use Compilation Options ! Normal program development, high-performance programs, SPEC disclosure. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Optimization Introduction ! :HP3: Section 2.11 ! :Def: Optimization ! The optional steps in compiling a program that reduce the program's ! execution time, reduce the size of the program, energy consumed, etc. ! Typically, the only time optimization is NOT done is when a program ! is being debugged. ! ! In most cases, the programmer sets overall optimization effort (low, ! medium, high). ! ! When performance is very important the programmer can specify which ! specific optimizations to use. ! Commands (Linux) ! Laptop: taskset 0xff perf stat ./pi ! taskset 0xff perf stat --log-fd 1 -e instructions ./pi | grep instr ! :Example: ! ! Data on a program that computes π with and without optimization. ! ! OS: Linux 6.8.5-301.fc40.x86_64 ! Hardware: 13th Gen Intel(R) Core(TM) i9-13950HX ! Compiler: gcc version 14.0.1 20240411 (Red Hat 14.0.1-0) (GCC) ! ! ! Without Optimization ! Clock : 5.171 GHz ! Instruction Count : 575,160,252 ! Run Time : 79.21 ms ! Insn Throughput : 7.261 G insn/s, 1.404 insn/cyc ! ! With Optimization ! Clock : 5.309 GHz ! Instruction Count : 250,160,137 ! Run Time : Let's predict. ! Predict Run Time: ! (* 79.21 (/ 250160137 575160252.0) (/ 5.171 5.309)); = 33.5561 ! Measured: ! Run Time : 37.18 ms ! (/ 250160137.0 (* 5.309 37.18 .001 1000000000.0)); = 1.2673 ! Commands: ! gcc pi.c -o pi-noopt -O0 ! gcc pi.c -o pi -O3 ! taskset 0xff perf stat ./pi ! taskset 0xff perf stat ./pi-noopt ! Programs: ! gcc: GNU C Compiler (more correct: GNU Compiler Collection) ! taskset: Utility to run code on certain cores. Fast ones in this case. ! perf: Utility to collect performance data. ! :Example: ! ! Data on a program that computes π with and without optimization. ! ! System: SunOS sol 5.6 Generic_105181-31 sun4u sparc SUNW,Ultra-Enterprise ! Compiler: Sun WorkShop 6 2000/04/07 C 5.1 ! Clock Frequency: 300 MHz ! ! Without Optimization ! Size : 6408 bytes ! Instruction Count : 325,338,749 ! Run Time : 3.00 s ! CPI : (/ (* 3.00 300e6) 325338749.0 ) = 2.7663 ! IPC : (/ 325338749.0 (* 3.00 300e6) ) = 0.3615 ! ! With Optimization ! Size : 6340 bytes Smaller! (OK, only a tiny bit smaller.) ! Instruction Count : 100,338,751 Just 1/3 the instructions!! ! Run Time: Let's predict this. ! Run Time : 1.65 s Half the speed. That's it? ! CPI : (/ (* 1.65 300e6) 100338751.0) = 4.9333 ! IPC : (/ 100338751.0 (* 1.65 300e6)) = 0.2027 ! ! Comparison of Un-optimized and Optimized Runs ! ! Un-optimized run takes 1.82 times longer. ! Un-optimized run executes 3.22 times more instructions. ! ! --> Execution time is not proportional to instruction count ! We've already seen how this can be true due to stalls. ! ! --> In the slower version .. ! .. instructions seem to be executed more efficiently [sic]. ! ! Quick Explanation (See insn scheduling discussion below for code.) ! ! The un-optimized version contained more easy instructions, such ! as load and store instructions (that would hit the cache). ! ! Both versions had the same number of floating-point divide instructions ! which take a long time to execute. !! Reasons *not* to Optimize ! ! Percentage shows roughly how often reason applies. ! ! 95% It makes debugging difficult, so don't optimize while debugging. ! This is true for almost everyone that uses a debugger. ! ! 10% It slows down compilation. ! Only important when there is a very large amount of code to ! recompile. (Back in the 20th century when computers were slow ! this was important.) ! ! .001% Optimization introduces bugs. ! It does, but very rarely. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Steps in Building a Program ! Typical Steps in Building a Program ! ! Pre-Process, Compile, Assemble, Link, Load ! pi.c -> pi.i -> pi.s -> pi.o -> pi ! ! These steps can all be automatically performed of the compiler ! driver. (cc, gcc, MS Visual Studio, etc.) ! ! :Sample: cc pi.c -o pi ! ! They can also be specified individually. ! ! More details appear below, compile is what we're interested in. !! Pre-Process ! Load include files, expand macros. ! Typical pre-processor name: cpp ! Input File: pi.c ! Output File: pi.i (High-Level Language) !! Compile ! Convert high-level language in to assembler. ! Typical compiler name: cc1 (not cc, or gcc, or Visual Studio) ! Input File: pi.i ! Output File: pi.s (Assembler) !! Assemble ! Convert assembly language in to machine language. ! Typical assembler name: as ! Input File: pi.s ! Output File: pi.o (Object file containing machine code.) !! Link ! Combine object files, libraries, and other code into executable. ! Typical linker name: ld ! Input File: pi.o ! Output File: pi !! Load ! Copy executable in to memory, link with shared libraries, and start. ! Loader name: exec system call. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Compiler Terminology ! A program is compiled using a steps or /passes/. ! ! The first pass reads the pre-processed high-level source code. ! ! The last pass emits assembler code. ! ! Between the first and last passes the program is in some ! intermediate representation. ! ! The way passes are defined and organized varies greatly from ! compiler to compiler. ! :Def: Pass ! A step in compiling a program. A pass usually looks at an entire function. ! :Def: Intermediate Representation ! The form of a program (sort of a special-purpose language) used internally ! by a compiler. A compiler might use several intermediate representations. !! Typical Passes ! ! Parse ! Convert the source code to a high-level intermediate representation (H-IR). ! ! High-Level Optimization ! Optional, may be done in several passes. ! Also called front-end optimization. ! Modify H-IR to improve performance or reduce code size. ! Reads and writes H-IR ! ! Low-Level Intermediate Representation (L-IR) Generation ! Convert H-IR to a low-level intermediate representation (L-IR). ! ! Low-Level Optimization (Optional, may be done in several passes.) ! Also called back-end optimization. ! Modify L-IR to improve performance or reduce code size. ! ! Register Assignment (Part of low-level optimization.) ! Choose machine registers. ! ! Code Generation ! Convert L-IR to assembly code. ! ! Pigeonhole Optimizations (Optional, may be done in several passes.) ! These are also called low-level optimizations. ! Modify L-IR to improve performance or reduce code size. ! Some of these can be done at link time. ! :Def: Compiler Front End ! The parts of the compiler that do the parsing and high-level ! optimization passes. Computer architects are less interested ! in this part. ! :Def: Compiler Back End ! The parts of the compiler that do the low-level optimization, ! register assignment, and code generation passes. Computer ! architects are very interested in this part. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! High-Level Optimizations ! Easy high-level optimizations presented here. !! Some Easy-To-Explain Front-End Optimizations ! ! Dead-Code Elimination (DCE) ! Common Subexpression Elimination (CSE) ! Constant Propagation, Folding ! :Def: Dead-Code Elimination (DCE) ! Removal of code which isn't used. ! Yes, it happens. ! This can also be a low-level optimization. ! :Example: ! Before x = a + b; x = c + d; ! After x = c + d; ! :Example: ! ! Code benefiting from DCE ! High-level code shown for clarity. Most compilers will transform ! an intermediate representation. ! ! Before: ! int main(int argv, char **argc) { double i; double sum = 0; for(i=1; i<50000000;) { sum = sum + 4.0 / i; i += 2; sum = sum - 4.0 / i; i += 2; } printf("Performed %d iterations. Thank you for running me.\n",i); } ! ! After: ! int main(int argv, char **argc) { double i; for(i=1; i<50000000;) { i += 2; i += 2; } printf("Performed %d iterations. Thank you for running me.\n",i); } ! ! Note: Other optimizations would leave only the printf. ! :Def: Common Subexpression Elimination (CSE) ! Remove duplicated code. ! ! Before: r = ( a + b ) / ( x + y ); s = ( a + b ) / ( x - y ); ! After: temperature = a + b; r = ( temperature ) / ( x + y ); s = ( temperature ) / ( x - y ); ! :Def: Constant Propagation, Folding ! The compiler performs whatever arithmetic it can at compile time ! rather than emitting code to perform the arithmetic at run time. ! ! Before: int sec_per_day = 60 * 60 * 24; int sec_per_year = sec_per_day * 365; some_routine(sec_per_day * x, sec_per_year * y); ! After: int sec_per_day = 86400; int sec_per_year = 31536000; some_routine(86400 * x, 31536000 * y); !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Low-Level Optimizations ! Some Low-Level Optimizations ! ! This is not a complete list. ! ! Register Assignment ! Instruction Selection ! Scheduling ! :Def: Register Assignment ! Selection of which values will be held in registers. Values ! not held in registers are stored in memory. ! ! Without Register Assignment Optimizations ! All values corresponding to variables (in high-level program) are ! written to memory (and not held in registers). Intermediate results are ! held in registers. ! ! With Register Assignment Optimization ! Registers are assigned to as many variables as possible, with priority ! given to frequently used variables. ! ! Advantage of Register Assignment Optimization ! Fewer memory writes and reads. !! Instruction Selection ! :Def: Scheduling ! Re-arranging instructions to minimize the amount of time one instruction ! has to wait for another. ! ! For example, if an instruction takes a long time it will be started early ! so that other instructions will not have to wait for its result. ! ! Scheduling will be covered in more detail later in the semester. ! :Example: ! ! pi program without and with optimization. ! ! Optimizations include register assignment and scheduling. ! ! Without Optimization ! ! 10 { ! 11 sum = sum + 4.0 / i; i += 2; ldd [%fp-24],%f6 ! f6 = sum ldd [%l0+0],%f4 ! f4 = 4.0 ldd [%fp-16],%f2 ! f2 = i fdivd %f4,%f2,%f2 ! f2 = 4.0 / i faddd %f6,%f2,%f2 ! f2 = sum + (4.0/i) std %f2,[%fp-24] ! sum = f2 ldd [%fp-16],%f4 ! f4 = i ldd [%l0+8],%f2 ! f2 = 2.0 faddd %f4,%f2,%f2 ! f4 = i + 2.0 std %f2,[%fp-16] ! i = f4 ! 12 sum = sum - 4.0 / i; i += 2; ldd [%fp-24],%f6 ! f6 = sum ldd [%l0+0],%f4 ! f4 = 4.0 ldd [%fp-16],%f2 ! f2 = i fdivd %f4,%f2,%f2 ! f2 = 4.0 / i fsubd %f6,%f2,%f2 ! f2 = sum - (4.0/i) std %f2,[%fp-24] ! sum = f2 ldd [%fp-16],%f4 ! f4 = i ldd [%l0+8],%f2 ! f2 = 2.0 faddd %f4,%f2,%f2 ! f2 = i + 2.0 std %f2,[%fp-16] ! i = f2 ldd [%fp-16],%f4 ! f4 = i ldd [%l0-8],%f2 ! f2 = 50000000 fcmped %f4,%f2 ! compare i, 50000000 nop fbl .L92 ! Branch if FP comparison less than. nop ! ! With Optimization ! ! 10 ! { ! 11 ! sum = sum + 4.0 / i; i += 2; ! .L77000016: /* 0x0020 11 */ fdivd %f4,%f30,%f6 ! temp1 = 4.0 / i_2 /* 0x0024 */ faddd %f30,%f2,%f8 ! i_1 = i_2 + 2.0 ! 12 ! sum = sum - 4.0 / i; i += 2; /* 0x0028 12 */ faddd %f8,%f2,%f30 ! i_2 = i_1 + 2.0 /* 0x002c */ fcmped %f30,%f0 ! i_2 < 50000000 /* 0x0030 */ fdivd %f4,%f8,%f8 ! temp2 = 4.0 / i_1 /* 0x0034 11 */ faddd %f10,%f6,%f6 ! sum = sum + temp1 /* 0x0038 12 */ fbl .L77000016 /* 0x003c */ fsubd %f6,%f8,%f10 ! sum = sum - temp2 ! 13 ! } !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Compiler Optimization Options ! Compiler options tell the compiler: ! ! How much EFFORT to put into optimization. E.g., -O2 ! Which PARTICULAR optimizations to perform. E.g., -fstrength-reduce ! The TARGET system the code will be running on. E.g., -xultra2 ! Whether to make certain ASSUMPTIONS about the code. ! Whether to use PROFILING data from a training run. !! Target Options ! ! Specify exact type of machine code will be run on: ISA, Implementation, Cache ! ! Choice is based on type of machines customers have. ! ! !! Specifying the: ISA ! ! The exact instruction set used. ! Specifies not just the ISA family, but a particular variation. ! A poor choice will limit the number of machines code can run on. ! !! Specifying the: Implementation (Processor Core) ! ! Specify the implementation code will run on. ! A poor choice will result in slower code. ! !! Background ! ! ARM AArch64 Architecture ! ! Developed by ARM. ! ! Some Implementations ! ! cortex-a53 ! cortex-a72 ! exynos-m1 ! :Example: ! ! Switches for GCC ! ! -march Specifies ISA ! -mtune Specifies the implementation. ! ! -mcpu Specifies both ISA and implementation ! Specify the name of the target processor for which GCC should tune ! the performance of the code. Permissible values for this option ! are: 'generic', 'cortex-a53', 'cortex-a57', 'cortex-a72', ! 'exynos-m1', 'thunderx', 'xgene1'. !! Optimization EFFORT ! ! :Def: Optimization Effort ! Amount of optimization to be done. A small effort means performing ! only easy optimizations, a large effort means performing more ! time-consuming optimizations. ! ! Most compilers have optimization levels. ! ! The higher the number, the more optimizations done. ! :Example: ! ! Optimization Levels for gcc Version 9: ! '-O' ! '-O1' ! Optimize. Optimizing compilation takes somewhat more time, and a ! lot more memory for a large function. ! ! With '-O', the compiler tries to reduce code size and execution ! time, without performing any optimizations that take a great deal ! of compilation time. ! ! '-O2' ! Optimize even more. GCC performs nearly all supported ! optimizations that do not involve a space-speed tradeoff. As ! compared to '-O', this option increases both compilation time and ! the performance of the generated code. ! ! ! '-O3' ! Optimize yet more. '-O3' turns on all optimizations specified by ! '-O2' and also turns on the ! [snip] ! ! '-O0' ! Reduce compilation time and make debugging produce the expected ! results. This is the default. ! ! '-Os' ! Optimize for size. '-Os' enables all '-O2' optimizations that do ! not typically increase code size. It also performs further ! optimizations designed to reduce code size. ! ! '-Ofast' ! Disregard strict standards compliance. '-Ofast' enables all '-O3' ! optimizations. It also enables optimizations that are not valid ! for all standard-compliant programs. It turns on '-ffast-math' and ! the Fortran-specific '-fno-protect-parens' and '-fstack-arrays'. ! ! '-Og' ! Optimize debugging experience. '-Og' enables optimizations that do ! not interfere with debugging. It should be the optimization level ! of choice for the standard edit-compile-debug cycle, offering a ! reasonable level of optimization while maintaining fast compilation ! and a good debugging experience. !! PARTICULAR Optimizations ! ! The levels specify sets of optimizations (like ordering the "Sport Package" ! for a new car). ! ! In contrast to optimization levels (-O3), the compiler can be told which ! particular optimizations to make. ! ! These are typically used by skilled programmers trying to get ! fastest code. ! ! Some examples below. ! gcc ! ! '-fschedule-insns' ! If supported for the target machine, attempt to reorder ! instructions to eliminate execution stalls due to required data ! being unavailable. This helps machines that have slow floating ! point or memory load instructions by allowing other instructions to ! be issued until the result of the load or floating-point ! instruction is required. ! ! ! `-frerun-loop-opt' ! Run the loop optimizer twice. ! ! '-fprefetch-loop-arrays' ! If supported by the target machine, generate instructions to ! prefetch memory to improve the performance of loops that access ! large arrays. ! ! This option may generate better or worse code; results are highly ! dependent on the structure of loops within the source code. !! ASSUMPTIONS (Assertions) About the Program ! Compilers must generate correct code. ! ! That is, the code must execute in the way specified by the ! high-level language definition. ! ! Correct code can be slow. ! ! The compiled code might need to check for things that can happen ... ! ... but don't in a particular program. ! ! Some options tell the compiler to make assumptions about the program. ! ! These assumptions would not hold for every program. ! ! The compiled program runs faster ... ! ... and correctly if the assumptions are valid. ! Some switches specifying assumptions: ! gcc, Assume the program does not require strict IEEE 754 FP features. ! `-ffast-math' ! This option allows GCC to violate some ANSI or IEEE rules and/or ! specifications in the interest of optimizing code for speed. For ! example, it allows the compiler to assume arguments to the `sqrt' ! function are non-negative numbers and that no floating-point values ! are NaNs. ! cc (Sun Workshop 6 Compiler) Assume certain pointers do not overlap. ! ! -xrestrict=f ! (SPARC) Treats pointer-valued function parameters as ! restricted pointers. f is a comma-separated list that ! consists of one or more function parameters, %all, ! %none. This command-line option can be used on its ! own, but is best used with optimization of -xO3 or ! greater. ! ! The default is %none. Specifying -xrestrict is ! equivalent to specifying -xrestrict=%all. ! ! :Example: ! ! In the loop below the compiler would ordinarily load the value at x ! from memory (dereference) each iteration (five times) because the ! address of x may be the same as the address of one of the "a" ! elements. With -xrestrict switch, x is not loaded each iteration, ! saving time and space. The switch is needed because the compiler ! has no way of knowing if x and the a's overlap and must otherwise ! make a conservative assumption (that they overlap) void array_add(int *a, int *b, int *x) { ! Suppose: &a[0] = 0x1000, &x[0] = 0x2000 // Normal. ! &a[1] = 0x1004, ! &a[2] = 0x1008, .. ! Suppose: &a[0] = 0x1000, &x[0] = 0x1004 // Huh?!? ! &a[1] = 0x1004, ! &a[2] = 0x1008, .. int i; for(i=0; i<5; i++) a[i] = b[i] + x[0]; ! Plan B auto xzero = x[0]; for(i=0; i<5; i++) a[i] = b[i] + xzero; } ! Compiled Code Without -xrestrict ! 24 ! int i; ! 25 ! for(i=0; i<5; i++) ! 26 ! a[i] = b[i] + x[0]; ! %g4 i ! %o2 x ! %g3 &a[i] ! %g2 &b[i] /* 0x0004 26 */ ld [%o1],%o5 ! o5 = Mem[o1] /* 0x0008 23 */ or %g0,%o1,%g2 ! g2 = 0 + o1 /* 0x000c 25 */ or %g0,0,%g4 ! g4 = 0 + 0 .L900000108: ! The loop body starts here. /* 0x0010 26 */ ld [%o2],%o4 ! <-- Load x, notice that /* 0x0014 */ add %g4,1,%g4 ! o2 does not change. /* 0x0018 */ add %g2,4,%g2 ! g2 = g2 + 4 /* 0x001c */ cmp %g4,5 /* 0x0020 */ add %o5,%o4,%o5 ! o5 = b[i] + x[0] /* 0x0024 */ st %o5,[%g3] ! a[i] = o5 /* 0x0028 */ add %g3,4,%g3 /* 0x002c */ bl,a .L900000108 /* 0x0030 */ ld [%g2],%o5 ! o5 = b[i+1] .L77000022: /* 0x0034 */ retl ! Result = /* 0x0038 */ nop ! Compiled Code With -xrestrict /* 000000 23 */ or %g0,%o0,%g4 ! 24 ! int i; ! 25 ! for(i=0; i<5; i++) ! 26 ! a[i] = b[i] + *x; ! o2 x ! g2 *x ! g4 &a[i] ! g3 &b[i] /* 0x0004 26 */ ld [%o2],%g2 ! <-- Load x before entering /* 0x0008 23 */ or %g0,%o1,%g3 ! the loop. /* 0x000c 26 */ ld [%o1],%o5 /* 0x0010 25 */ or %g0,0,%g1 .L900000108: ! The loop body starts here. /* 0x0014 26 */ add %o5,%g2,%o5 /* 0x0018 */ st %o5,[%g4] /* 0x001c */ add %g1,1,%g1 /* 0x0020 */ add %g3,4,%g3 /* 0x0024 */ add %g4,4,%g4 /* 0x0028 */ cmp %g1,5 /* 0x002c */ bl,a .L900000108 /* 0x0030 */ ld [%g3],%o5 .L77000022: /* 0x0034 */ retl ! Result = /* 0x0038 */ nop ! If array_add is compiled using the -xrestrict switch and it is ! called by the code below, it will return the wrong answer (since x ! changes value). int main(int argv, char **argc) { int a[5], b[5]; int *x = &a[1]; ... (More code here) array_add(a,b,x); } ! (End of example.) !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Profiling !! ! :Def: Profiling (a.k.a. Feedback-Directed Optimization (FDO)) ! The techniques used to optimize a program using statistics collected ! from a training run of the program. !! Typical Profiling Procedure ! ! (1) Compile program with profiling option. ! ! (2) Run program using typical input data. ! This run is called the training run and the input data is ! called the training data. ! During the training run statistics are collected. ! ! (3) Run compiler again, specifying where to find the statistics. ! :Example: ! ! Optimization of a simple if/else statement: ! ! if ( r1 != r2 ) // Mostly r1 equals r2. ! { IFPART: r3 = r4 + r5; } ! else ! { ELSEPART: r9 = r10 - r11; } // More likely path ! ! CONTINUE: r6 = r7 ^ r8; ! ! On the target there is a large branch penalty .. ! .. and so the optimizer should try to .. ! .. minimize the number of *taken* branches. !! !! Unoptimized Code !! bne r1, r2 IFPART nop ELSEPART: j CONTINUE; sub r9, r10, r11 IFPART: add r3, r4, r5 CONTINUE: xor r6, r7, r8 ! When r1==r2: 4 insn, 1 taken control transfer. More likely ! When r1!=r2: 3 insn, 1 taken control transfer. ! ! In the unoptimized code above .. ! .. there is one taken control transfer (bne or j) .. ! .. whether or not r1 == r2. !! !! Optimized Using Profile Data !! ! Compiler now "knows" that r1==r2 is the common case. bne r1, r2 IFPART nop ELSEPART: sub r9, r10, r11 CONTINUE: xor r6, r7, r8 !! Lots of additional code. !! Somewhere far away. IFPART: add r3, r4, r5 j CONTINUE; nop ! ! In the profile-optimized code above .. ! .. for the common case, r1 == r2, there are zero taken control transfers .. ! .. and for the uncommon case, there are two (bne and j). ! Branches occur frequently in code. There is a performance penalty in ! taking a branch and so it's best if the compiler organizes code ! (rearranges the intermediate representation) so that branches are ! not taken as much as possible. To do that the compiler needs to know ! how often an "if" or other condition (for which the branch is ! emitted) is true. Only in a few cases can the compiler figure that ! out on its own, because, for example, "if" conditions that depend on ! input data. To obtain this useful information a two-step compilation ! process called profiling is used. In the first step the code is ! compiled so that it writes branch information (more precisely basic ! block, covered later) to a file. The program is run with typical ! input data, called the training input, and it writes the branch ! information to a file. In the second step the compiler reads the ! information and uses that to better organize the code. !! GCC Profiling Compiler Switches ! `-fprofile-generate' ! `-fprofile-generate=PATH' ! Enable options usually used for instrumenting application to ! produce profile useful for later recompilation with profile ! feedback based optimization. You must use `-fprofile-generate' ! both when compiling and when linking your program. ! ! The following options are enabled: `-fprofile-arcs', ! `-fprofile-values', `-fvpt'. ! ! If PATH is specified, GCC will look at the PATH to find the ! profile feedback data files. See `-fprofile-dir'. ! ! `-fprofile-use' ! `-fprofile-use=PATH' ! Enable profile feedback directed optimizations, and optimizations ! generally profitable only with profile feedback available. ! ! The following options are enabled: `-fbranch-probabilities', ! `-fvpt', `-funroll-loops', `-fpeel-loops', `-ftracer' ! ! By default, GCC emits an error message if the feedback profiles do ! not match the source code. This error can be turned into a ! warning by using `-Wcoverage-mismatch'. Note this may result in ! poorly optimized code. ! ! If PATH is specified, GCC will look at the PATH to find the ! profile feedback data files. See `-fprofile-dir'. ! ! `-fbranch-probabilities' ! After running a program compiled with `-fprofile-arcs' (*note ! Options for Debugging Your Program or `gcc': Debugging Options.), ! you can compile it a second time using `-fbranch-probabilities', ! to improve optimizations based on the number of times each branch ! was taken. When the program compiled with `-fprofile-arcs' exits ! it saves arc execution counts to a file called `SOURCENAME.gcda' ! for each source file. The information in this data file is very ! dependent on the structure of the generated code, so you must use ! the same source code and the same optimization options for both ! compilations. ! ! With `-fbranch-probabilities', GCC puts a `REG_BR_PROB' note on ! each `JUMP_INSN' and `CALL_INSN'. These can be used to improve ! optimization. Currently, they are only used in one place: in ! `reorg.c', instead of guessing which path a branch is mostly to ! take, the `REG_BR_PROB' values are used to exactly determine which ! path is taken more often. ! ! `-fprofile-values' ! If combined with `-fprofile-arcs', it adds code so that some data ! about values of expressions in the program is gathered. ! ! With `-fbranch-probabilities', it reads back the data gathered ! from profiling values of expressions and adds `REG_VALUE_PROFILE' ! notes to instructions for their later usage in optimizations. ! ! Enabled with `-fprofile-generate' and `-fprofile-use'. ! ! `-fvpt' ! If combined with `-fprofile-arcs', it instructs the compiler to add ! a code to gather information about values of expressions. ! ! With `-fbranch-probabilities', it reads back the data gathered and ! actually performs the optimizations based on them. Currently the ! optimizations include specialization of division operation using ! the knowledge about the value of the denominator. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !! Use of Compiler Switches !! Use of Compiler Switches ! ISA ! Type of system all uses have. (Intel 64 for PCs, SPARC for Sun users, etc.) ! Users can't normally run code compiled for a different ISA. ! Implementation ! Type of system most users have. ! Other users can run code, but won't run as fast. ! Optimization ! Select medium or high optimization level. ! If market very sensitive to performance, use specific optimizations.