LSU EE 7700-2 -- Fall 2003 -- Multiprocessors, Etc. Note Set 7 -- Critical Path Compression Under Construction, To Do Finish UC list. Contents References Listed in each section. Linked to http://www.ece.lsu.edu/tca/ref.html -------------------------------------------------------------------------------- Critical Path (Dataflow?) Compression Techniques Goal Reduce number of instructions on critical path. Techniques Predict Optimize Remember (Memoize, special case of optimization.) Running Example Digit identifies instruction, critical path shown. 1-2-3-4-5-6-7 Value Prediction Predict source operands. Example showing correct prediction. 1-2-3\ (Sources for 4 predicted) 4-5-6-7 Example showing incorrect prediction. 1-2-3-4-5-6-7 4-5-6x (Squash) Promising Because Many values predictable. Impractical (So Far) Because Predictor table sizes much too large. Computation Re-Use Can re-use individual instruction or a slice (several insn). Insn re-use: for a PC cache source operands ... and results ... for re-use. Example, 3 is a divide instruction: W/O Reuse: 1-2-3-------4-5 With Reuse: 1-2- r-4-5 r is time for lookup. Slice re-use: for a PC cache live-ins ... and live-outs... for re-use. 1-2-3-7 (4-5-6 cached). Promising Because Over 90% of computations re-used (check). Impractical Because Table lookup starts too late to be useful. Table sizes much too large. Optimization Cache, Dynamic Recompilation Prepare optimized version of code at run-time. Store optimized code. 1-a-b-c-7 (a-b-c equivalent to 2-3-4-5-6) Promising Because With branch outcomes and stable reg values ... can do many optimizations. Not Yet Implemented Because Complexity. Store / Load Renaming Predict that store/load pair use same address. Master/Slave Speculative Parallelization One processor runs fast & dangerous code ... other processors run slow and safe ... using and verifying values from fast & dangerous. a-b-c (a provides 4's input, b provides 6's input) 1-2-3 (3 verifies 4's input, 5 verifies 6's input) 4-5 6-7 Techniques Covered Elsewhere or Omitted. Prefetching. Load address prediction. Branch prediction. -------------------------------------------------------------------------------- Value Prediction Gabbay & Mendelson ToCS 1998 Techniques Last value. Stride. Correlated. (FCM). Like local history. (Not covered in G & M ) Hybrid. (Not covered in G & M ) Performance of Last Value, Stride SPEC Int95, SPARC, Infinite tables. Prediction Accuracy of Load Values: Stride, 48%; Last-Value, 50%. Prediction Accuracy of ALU Values: Stride, 61%; Last-Value, 52% Performance (speedup, etc) not analyzed. Value Classification (Prediction Confidence) Use saturating counter to assign confidence. -------------------------------------------------------------------------------- Computation Re-Use S_X: Sodani and Sohi, ISCA 1997 Block Reuse: Huang and Lilja DTM: daCosta, Franca, and Filho, PACT 2000 S & S S_v: Index table with value. S_n: Index table with register number. S_n+d: Index table with register number, chain. DTM Detect re-use of individual instructions. Build re-use traces from re-used instructions. Block Reuse Compiler identifies block inputs and outputs. Comparison (daCosta et al) Table size: 784 kB (DTM, Sn+d), 1690 kB (Block Reuse) Four-way SS, DS, 2048-entry bimodal branch predictor, 16-entry ROB. Cache 16 kiB insn and data, no L2 cache. Note: System has unrealistically fast memory, inflating results. Table sizes far exceed the cache size and so 1-cycle access unrealistic. Comparison Results S_n+d Speedup: 4.1% Block Reuse Speedup: 3.9% DTM Trace Speedup: 10% -------------------------------------------------------------------------------- Dynamic Recompilation rePlay: Patel & Lumetta, Micro 2001 rePlay Frame cache. Frame predictor. Optimizations DCE Constant propagation. Reassociation. (Reduce height of dataflow tree.) CSE. (Common subexpression elimination.) Inlining. Fetch Scheduling Misc: Strength reduction, idioms, etc. Simulations SimpleScalar/Alpha SPEC 2000 Eight-way 1024-entry ROB. L1 Data 64 kiB 1-cycle, L1 Insn 16 kiB, L2 1 MiB 8-cycle, Memory 50-cycle Frame Cache: 48 kiB. Path predictor: 8kiB table. Six integer FUs. Branch considered biased: same outcome 32 consecutive times. Optimization time: 1000 cycles. Comparison Systems rePlay rePlay without optimization Conventional, 64 kiB icache, 8 kiB gshare, 4 kiB JTB. Trace cache with 16-insn entries, 32 kiB trace cache, 32 kiB icache, 8 kiB gshare. Speedups: Conv: 21% rePlay w/o opt: 13% Trace cache 18%