LSU EE 7700-2 -- Fall 2003 -- Multiprocessors, Etc.
Note Set 7 -- Critical Path Compression
Under Construction, To Do
Finish UC list.
Contents
References
Listed in each section.
Linked to http://www.ece.lsu.edu/tca/ref.html
--------------------------------------------------------------------------------
Critical Path (Dataflow?) Compression Techniques
Goal
Reduce number of instructions on critical path.
Techniques
Predict
Optimize
Remember (Memoize, special case of optimization.)
Running Example
Digit identifies instruction, critical path shown.
1-2-3-4-5-6-7
Value Prediction
Predict source operands.
Example showing correct prediction.
1-2-3\ (Sources for 4 predicted)
4-5-6-7
Example showing incorrect prediction.
1-2-3-4-5-6-7
4-5-6x (Squash)
Promising Because
Many values predictable.
Impractical (So Far) Because
Predictor table sizes much too large.
Computation Re-Use
Can re-use individual instruction or a slice (several insn).
Insn re-use: for a PC cache source operands ...
and results ...
for re-use.
Example, 3 is a divide instruction:
W/O Reuse: 1-2-3-------4-5
With Reuse: 1-2- r-4-5 r is time for lookup.
Slice re-use: for a PC cache live-ins ...
and live-outs...
for re-use.
1-2-3-7 (4-5-6 cached).
Promising Because
Over 90% of computations re-used (check).
Impractical Because
Table lookup starts too late to be useful.
Table sizes much too large.
Optimization Cache, Dynamic Recompilation
Prepare optimized version of code at run-time.
Store optimized code.
1-a-b-c-7 (a-b-c equivalent to 2-3-4-5-6)
Promising Because
With branch outcomes and stable reg values ...
can do many optimizations.
Not Yet Implemented Because
Complexity.
Store / Load Renaming
Predict that store/load pair use same address.
Master/Slave Speculative Parallelization
One processor runs fast & dangerous code ...
other processors run slow and safe ...
using and verifying values from fast & dangerous.
a-b-c (a provides 4's input, b provides 6's input)
1-2-3 (3 verifies 4's input, 5 verifies 6's input)
4-5
6-7
Techniques Covered Elsewhere or Omitted.
Prefetching.
Load address prediction.
Branch prediction.
--------------------------------------------------------------------------------
Value Prediction
Gabbay & Mendelson ToCS 1998
Techniques
Last value.
Stride.
Correlated. (FCM). Like local history. (Not covered in G & M )
Hybrid. (Not covered in G & M )
Performance of Last Value, Stride
SPEC Int95, SPARC, Infinite tables.
Prediction Accuracy of Load Values: Stride, 48%; Last-Value, 50%.
Prediction Accuracy of ALU Values: Stride, 61%; Last-Value, 52%
Performance (speedup, etc) not analyzed.
Value Classification (Prediction Confidence)
Use saturating counter to assign confidence.
--------------------------------------------------------------------------------
Computation Re-Use
S_X: Sodani and Sohi, ISCA 1997
Block Reuse: Huang and Lilja
DTM: daCosta, Franca, and Filho, PACT 2000
S & S
S_v: Index table with value.
S_n: Index table with register number.
S_n+d: Index table with register number, chain.
DTM
Detect re-use of individual instructions.
Build re-use traces from re-used instructions.
Block Reuse
Compiler identifies block inputs and outputs.
Comparison (daCosta et al)
Table size: 784 kB (DTM, Sn+d), 1690 kB (Block Reuse)
Four-way SS, DS, 2048-entry bimodal branch predictor, 16-entry ROB.
Cache 16 kiB insn and data, no L2 cache.
Note:
System has unrealistically fast memory, inflating results.
Table sizes far exceed the cache size and so 1-cycle access unrealistic.
Comparison Results
S_n+d Speedup: 4.1%
Block Reuse Speedup: 3.9%
DTM Trace Speedup: 10%
--------------------------------------------------------------------------------
Dynamic Recompilation
rePlay: Patel & Lumetta, Micro 2001
rePlay
Frame cache.
Frame predictor.
Optimizations
DCE
Constant propagation.
Reassociation. (Reduce height of dataflow tree.)
CSE. (Common subexpression elimination.)
Inlining.
Fetch Scheduling
Misc: Strength reduction, idioms, etc.
Simulations
SimpleScalar/Alpha
SPEC 2000
Eight-way 1024-entry ROB.
L1 Data 64 kiB 1-cycle, L1 Insn 16 kiB, L2 1 MiB 8-cycle, Memory 50-cycle
Frame Cache: 48 kiB.
Path predictor: 8kiB table.
Six integer FUs.
Branch considered biased: same outcome 32 consecutive times.
Optimization time: 1000 cycles.
Comparison Systems
rePlay
rePlay without optimization
Conventional, 64 kiB icache, 8 kiB gshare, 4 kiB JTB.
Trace cache with 16-insn entries, 32 kiB trace cache, 32 kiB icache, 8 kiB gshare.
Speedups:
Conv: 21%
rePlay w/o opt: 13%
Trace cache 18%