LSU EE 7700-2 -- Fall 2003 -- Multiprocessors, Etc.

Note Set 7 -- Critical Path Compression

Under Construction, To Do
 Finish UC list.

Contents

References
  Listed in each section.
  Linked to http://www.ece.lsu.edu/tca/ref.html

--------------------------------------------------------------------------------

Critical Path (Dataflow?) Compression Techniques

 Goal
  Reduce number of instructions on critical path.

 Techniques
  Predict
  Optimize
  Remember (Memoize, special case of optimization.)

 Running Example
  Digit identifies instruction, critical path shown.
  1-2-3-4-5-6-7

 Value Prediction
  Predict source operands.
  Example showing correct prediction.
   1-2-3\     (Sources for 4 predicted)
   4-5-6-7
  Example showing incorrect prediction.
   1-2-3-4-5-6-7
   4-5-6x  (Squash)
  Promising Because
   Many values predictable.
  Impractical (So Far) Because
   Predictor table sizes much too large.

 Computation Re-Use
  Can re-use individual instruction or a slice (several insn).
  Insn re-use: for a PC cache source operands ...
   and results ...
   for re-use.
   Example, 3 is a divide instruction:
    W/O Reuse: 1-2-3-------4-5
    With Reuse: 1-2- r-4-5  r is time for lookup.
  Slice re-use: for a PC cache live-ins ...
   and live-outs...
   for re-use.
   1-2-3-7   (4-5-6 cached).
  Promising Because
   Over 90% of computations re-used (check).
  Impractical Because
   Table lookup starts too late to be useful.
   Table sizes much too large.

 Optimization Cache, Dynamic Recompilation
  Prepare optimized version of code at run-time.
  Store optimized code.
   1-a-b-c-7  (a-b-c equivalent to 2-3-4-5-6)
  Promising Because
   With branch outcomes and stable reg values ...
    can do many optimizations.
  Not Yet Implemented Because
   Complexity.

 Store / Load Renaming
  Predict that store/load pair use same address.

 Master/Slave Speculative Parallelization
  One processor runs fast & dangerous code ...
   other processors run slow and safe ...
   using and verifying values from fast & dangerous.
   a-b-c  (a provides 4's input, b provides 6's input)
   1-2-3  (3 verifies 4's input, 5 verifies 6's input)
     4-5
       6-7
  
 Techniques Covered Elsewhere or Omitted.
  Prefetching.
  Load address prediction.
  Branch prediction.



--------------------------------------------------------------------------------
Value Prediction 

 Gabbay & Mendelson ToCS 1998

 Techniques
  Last value.
  Stride. 
  Correlated. (FCM).  Like local history.  (Not covered in G & M )
  Hybrid.  (Not covered in G & M )

 Performance of Last Value, Stride
  SPEC Int95, SPARC, Infinite tables.
  Prediction Accuracy of Load Values:  Stride, 48%; Last-Value, 50%.
  Prediction Accuracy of ALU Values: Stride, 61%; Last-Value, 52%
  Performance (speedup, etc) not analyzed.

 Value Classification (Prediction Confidence)
  Use saturating counter to assign confidence.
  

--------------------------------------------------------------------------------

Computation Re-Use

 S_X: Sodani and Sohi, ISCA 1997
 Block Reuse: Huang and Lilja
 DTM: daCosta, Franca, and Filho, PACT 2000

 S & S
  S_v: Index table with value.
  S_n: Index table with register number. 
  S_n+d: Index table with register number, chain.

 DTM
  Detect re-use of individual instructions.
  Build re-use traces from re-used instructions.

 Block Reuse
  Compiler identifies block inputs and outputs.  

 Comparison (daCosta et al)
  Table size: 784 kB (DTM, Sn+d), 1690 kB (Block Reuse)
  Four-way SS, DS, 2048-entry bimodal branch predictor, 16-entry ROB.
  Cache 16 kiB insn and data, no L2 cache.
  Note: 
   System has unrealistically fast memory, inflating results.
   Table sizes far exceed the cache size and so 1-cycle access unrealistic.

 Comparison Results
  S_n+d       Speedup: 4.1%
  Block Reuse Speedup: 3.9%
  DTM Trace   Speedup: 10%
  
--------------------------------------------------------------------------------
Dynamic Recompilation

 rePlay: Patel & Lumetta, Micro 2001 

 rePlay
  Frame cache.
  Frame predictor.

 Optimizations
  DCE
  Constant propagation.
  Reassociation. (Reduce height of dataflow tree.)
  CSE. (Common subexpression elimination.)
  Inlining.
  Fetch Scheduling
  Misc: Strength reduction, idioms, etc.
  

 Simulations
  SimpleScalar/Alpha
  SPEC 2000
  Eight-way 1024-entry ROB.
  L1 Data 64 kiB 1-cycle, L1 Insn 16 kiB, L2 1 MiB 8-cycle, Memory 50-cycle
  Frame Cache: 48 kiB.
  Path predictor: 8kiB table.
  Six integer FUs.
  Branch considered biased: same outcome 32 consecutive times.
  Optimization time: 1000 cycles.

 Comparison Systems
  rePlay
  rePlay without optimization
  Conventional, 64 kiB icache, 8 kiB gshare, 4 kiB JTB.
  Trace cache with 16-insn entries, 32 kiB trace cache, 32 kiB icache, 8 kiB gshare.

 Speedups:
  Conv: 21%
  rePlay w/o opt: 13%
  Trace cache 18%