LSU EE 7700-2 -- Fall 2003 -- Multiprocessors, Etc. Note Set 6 -- Control Independence Exploitation Under Construction, To Do Update under construction list. Contents Introduction Rotenberg WR-FD Skipper Multiscalar/IMT Trace Processor (Not covered) Conclusions Goals (For Studying) References Listed in each section. Linked to http://www.ece.lsu.edu/tca/ref.html Glossary DS: Dynamically Scheduled. (a.k.a., out-of-order) JTB: (a.k.a. BTB) A PC-indexed table listing predicted targets for indirect jumps. LSQ: Load/Store Queue. RAS: Return address stack. Used to predict returns, assumes standard call/return insn usage. ROB: Reorder Buffer (a.k.a., active list) Register Map: Table listing physical register assigned to each architected register. Store/Load Disambiguation: Determining whether a SS: Superscalar. -------------------------------------------------------------------------------- Introduction Under Construction (20% done) Reference Rotenberg, HPCA 99 Status Experimental, implementations too complex and costly. Exploitation Features (Rotenberg Terminology) Out of order fetch (WR) or selective squash (nWR). Speculate (false) dependencies (nFD) or wait for branch resolution (FD). Other Terminology S-Thread: Thread running code that was skipped (so it is run later than usual). Potential Performance Improvement Depends on: fetch rate and path of critical path. Cases Execution following branch is fetch-limited. (nWR better.) Critical path at branch target or fall-through. (WR better.) Example (Blackboard) Potential Speedup From Limit-Study Simulation About five times. Wow! Maybe. Assumptions Perfect fetch (16 insn/cycle, any number of branches). All loads have 1-cycle access. CTIs: 2^16 entry gshare, 2^16 entry JTB, RAS. Maybe 4-5 branch penalty. Unlimited execution resources. Perfect store/load address disambiguation. Perfect knowledge of reconvergence point and skipped code. Limit Study Conclusions (See Figure 3 graphs.) Oracle v. nWR-nFD: gcc, for 2ki ROB about 2 IPC difference (large). nWR-nFD not as good as oracle because: Critical path passes through skip thread. Reconvergence points cannot be found for some branches. nWR-nFD v. WR-nFD: gcc, for 2ki ROB, almost 4 IPC difference (large) WR not as good as nWR because: Guess: Long skip thread and long resolution time. Register Re-Mapping Headaches T-Map: ID Register Map just after Trigger. R-Map: ID Register Map just before Reconvergence. nWR Headache: Have to predict R-Map. WR Headache: Have to predict R-Map or re-rename. These headaches make CI exploitation impractical (so far). Control Independence Exploitation Techniques Described in detail below. Rotenberg WR-FD Skipper Multiscalar/IMT Trace Processor (Not covered) -------------------------------------------------------------------------------- Rotenberg WR-FD Rotenberg, Jacobson, Smith; HPCA 1999 Published with control independence limit study. Key Details Fetch is along predicted path (WR). On misprediction squash to reconvergence point (if reached). Ignore unknown dependencies. Reconvergence point detected by compiler or pre-processing software. Recovery Re-map source registers, re-issue if necessary. Re-issue loads. Re-mapping occurs at same rate as original mapping. Reconvergence On misprediction scan reorder buffer for reconvergence point. Simulation 16-way superscalar DS, ROB 128 to 512 entries. Front End: Ideal fetch, 64 ki entry gshare, JTB, RAS. Data Cache: L1 64 kiB, 4-way, 2 cycle; L2: perfect, 14 cycle hit. Store/Load Disambiguation: Execute load, re-execute on error. Systems Base (Ordinary system). CI CI-I: Zero-cycle recovery. Benchmarks: selected SPEC95 (mostly integer). Simulation Results Improvement of CI about 10%. CI-I is not much faster. Over 50% of mispredictions reconverge. About 50 instructions after reconvergence point (at resolve?). Simulation Limitations Impact of Branch Resolution Time Important because that determines how many CI instructions available. Overstated: Fetch unrealistically fast: 16-way and perfect instruction supply. Understated: Pipeline depth short compared to current processors. Two may cancel each other. Impact of L2 Cache Misses Not modeled since cache is perfect. -------------------------------------------------------------------------------- Skipper Cher and Vijaykumar, Micro 1999 Skipper CI exploitation technique. Key Details Completely implemented in hardware. At some branches fetch skips to reconvergence point. Instructions with unknown dependencies not executed. Use confidence estimator to choose branch. Reconvergence Point Detection Troublesome branches found by confidence estimator. Reconvergence point guessed based on branch direction: Forward: target of jump just before branch target. Backward: fall-through instruction. Definitions S-Thread: (class) thread running skipped instructions. SCIT: Skipped Computation Information Table Table storing information about skipable instructions. Indexed By: PC of troublesome branch. Fields: Reconvergence PC. Maximum number of skipped instructions. S-thread input registers. (Main thread registers read by s-thread.) S-thread output registers. (s-thread registers read by main thread.) All information in SCIT is approximate ... because paths may be taken differently when entry used. Skipping (Overview) Branch Encountered, Maybe Skip Confidence estimator assigns low confidence to branch prediction. Retrieve entry from SCIT using PC of branch. Copy current register map for S-thread. Update main-thread register map.. using predicted S-thread output registers.. and pre-assigned physical registers. Create gap in ROB and LSQ for S-thread. In main thread jump to reconvergence point.. using updated register map. Continue main thread until branch resolves. Main thread uses ROB and LSQ entries after the respective gaps. Branch Resolves: Start S-thread, use ROB and LSQ entries in the respective gaps. Execute S-thread until reconvergence PC reached. Reconvergence PC Reached. Inject pmove instructions: pmove PD, RS RS: an architected register, renamed using S-thread map (at end of S-thread). PD: the physical register preassigned to RS before S-thread started. Switch back to main thread, skipping complete. Simulated Configuration Eight-way superscalar DS; 128-, 256-entry ROB; 43-entry LSQ. Front End: Predict and fetch two branches per cycle (256 ROB version) Nine-cycle mispredict penalty. Cache: L1 64 kiB, 2-cycle; L2 2-MiB 12-cycle, Memory 100-cycle. SCIT: 128 entries. Maximum gap: 64 insn. Confidence Estimator: JRS, 4 bits, Simulation Results Roughly 8% speedup. Predictions of reconvergence point accurate. About 25% of mispredicted branches skipped -- on the low side. Reasons given: gap length (reconvergence too far away). -------------------------------------------------------------------------------- Multiscalar/IMT Multiscalar: Sohi, Breach, and Vijaykumar, ISCA 1995 IMT: Park, Falsafi, and Vijaykumar, ISCA 2003 Multiscalar Goals Faster clock frequency by dividing processor into small clusters ... with each cluster executing a contiguous portion of program (task). Improved performance by exploiting control independence. IMT Goals Exploit control independence. Unlike Multiscalar, uses one large processor core. Common Ideas Compiler divides program into tasks. Hardware executes several tasks in parallel ... Multiscalar: one task per processing unit. IMT: one task per thread. Fetch and decode switches between tasks. Tasks Small parts of program, can be several basic blocks. Each task has a single entry point but can have multiple exits. Identified by compiler. Task Descriptor determined for each task containing: Create Mask: List of live-out (later-used) registers. Use Mask: List of registers read by task that were written by insn before task. Information about task exits. (Used to predict next task.) Instructions have the following extra information: Stop "Bit": Indicator that instruction is last in task. Forward Bit: Indicator that dest. register is a live out. Added Instruction Release: Sends register to next task. Used when forward bit wouldn't work. Task Creation (Compiler Pass) Done after compiler produces ordinary assembler code. Decide on maximum task size (instructions). Decide on maximum task exits. Start with Control-Flow Graph (CFG) Nodes are basic blocks. Arcs are paths. Combine nodes so that: Maximum number of insn and exits not exceeded. Favor blocks fewer external dependencies. Favor blocks containing reconvergence points. Multiscalar Processing Unit (PU) Multiscalar processor has several PUs. Small processor, has own register file. One task per PU. Live out register values sent between PUs .. there is no shared register file. Special structure, ARB, for coordinating loads and stores. Multiscalar Sequencer Tasks predicted using local history predictor. Other predictors used in later Multiscalar papers. Multiscalar Simulations Comparison 1- and 2-way static and dynamically scheduled superscalar. 4 (4x1)-, 8 (8x1,4x2)- and 16 (8x2)-way Multiscalar. Cache: 64 kiB, 1-cycle latency; memory 13-cycle latency. Benchmarks: SPEC92 integer and fp, and other integer. Multiscalar Results 2-way ds ss v. 8x2-way multiscalar. For integer SPEC 92 benchmarks, speedup 1.43 Small utilities and synthetic benchmark have higher speedup. (/ (+ 1.33 2.64 1.24 0.95 1.5 0.9) 6) = 1.4267 About 85% task prediction accuracy. Comments: Multiscalar system has eight times the decode issue width (and cost). Hard to draw conclusions without simulation of 8- and 16-way ss systems. Cache miss latency is too low, speedups may have been lower. Multiscalar Shortcomings Requires evenly distributed critical path... execution slows at critical path clumps. This is because resources are partitioned into PUs. IMT - Implicit Multithreading Start with code compiled for multiscalar. Run on something like an SMT (simultaneous multithreaded processor). Key Differences Between IMT and Multiscalar Functional units are shared between tasks. Fetch based on predicted dependence ... and resource usage. Non-Speculative Task The oldest in-flight task is non-speculative. DRP - Dynamic Resource Predictor Predicts ROB slots and physical regs needed. Used to determine whether to fetch a speculative task. ITDH - Inter-Thread (task) Dependence Predictor Three consecutive identical tasks: Assumed to be loop iterations. Are predicted to be independent. Anything else dependent. IMT Fetch If tasks dependent or if not enough resources ... fetch only non-speculative (oldest) task. Otherwise fetch task with fewest in-flight instructions. IMT Simulations All: Eight-way dynamically scheduled. Base: 1024-entry ROB. All: Two predictions per cycle. IMT: 64-entry DRP, 3-entry ITDH Benchmarks: SPEC2000 integer and FP. IMT Results Integer: 20% speedup, FP 30% speedup. -------------------------------------------------------------------------------- Conclusions Potential Large, based on limit study. Rotenberg WR-nFD Details on finding reconvergence not worked out. Skipper Small speedup, moderate-to-high hardware complexity. Probably won't be implemented. (2003) IMT Good speedup. Moderate to high hardware complexity. Requires special compilation or post-processing. -------------------------------------------------------------------------------- Study Goals Control Independence Concept Know what it is and why one might want to exploit it. Understand how critical path determines how well it can be exploited. Understand how dependencies complicate things. For Each Technique Show a favorable or unfavorable code example ... and how they execute. Describe the special tables used and for each answer: How is it indexed? (PC of decoding instruction, etc.) What information is in each entry and what is the info used for? When is the table read? When is the table updated? Show other special hardware used. Show how the hardware exploits control independence. Explain the technique's shortcomings.