LSU EE 7700-2 -- Fall 2003 -- Multiprocessors, Etc.
Note Set 6 -- Control Independence Exploitation
Under Construction, To Do
Update under construction list.
Contents
Introduction
Rotenberg WR-FD
Skipper
Multiscalar/IMT
Trace Processor (Not covered)
Conclusions
Goals (For Studying)
References
Listed in each section.
Linked to http://www.ece.lsu.edu/tca/ref.html
Glossary
DS: Dynamically Scheduled. (a.k.a., out-of-order)
JTB: (a.k.a. BTB) A PC-indexed table listing predicted targets for indirect jumps.
LSQ: Load/Store Queue.
RAS: Return address stack. Used to predict returns, assumes standard call/return insn usage.
ROB: Reorder Buffer (a.k.a., active list)
Register Map: Table listing physical register assigned to each architected register.
Store/Load Disambiguation: Determining whether a
SS: Superscalar.
--------------------------------------------------------------------------------
Introduction
Under Construction (20% done)
Reference
Rotenberg, HPCA 99
Status
Experimental, implementations too complex and costly.
Exploitation Features (Rotenberg Terminology)
Out of order fetch (WR) or selective squash (nWR).
Speculate (false) dependencies (nFD) or wait for branch resolution (FD).
Other Terminology
S-Thread: Thread running code that was skipped (so it is run later than usual).
Potential Performance Improvement
Depends on: fetch rate and path of critical path.
Cases
Execution following branch is fetch-limited. (nWR better.)
Critical path at branch target or fall-through. (WR better.)
Example (Blackboard)
Potential Speedup From Limit-Study Simulation
About five times. Wow! Maybe.
Assumptions
Perfect fetch (16 insn/cycle, any number of branches).
All loads have 1-cycle access.
CTIs: 2^16 entry gshare, 2^16 entry JTB, RAS.
Maybe 4-5 branch penalty.
Unlimited execution resources.
Perfect store/load address disambiguation.
Perfect knowledge of reconvergence point and skipped code.
Limit Study Conclusions
(See Figure 3 graphs.)
Oracle v. nWR-nFD: gcc, for 2ki ROB about 2 IPC difference (large).
nWR-nFD not as good as oracle because:
Critical path passes through skip thread.
Reconvergence points cannot be found for some branches.
nWR-nFD v. WR-nFD: gcc, for 2ki ROB, almost 4 IPC difference (large)
WR not as good as nWR because:
Guess: Long skip thread and long resolution time.
Register Re-Mapping Headaches
T-Map: ID Register Map just after Trigger.
R-Map: ID Register Map just before Reconvergence.
nWR Headache: Have to predict R-Map.
WR Headache: Have to predict R-Map or re-rename.
These headaches make CI exploitation impractical (so far).
Control Independence Exploitation Techniques
Described in detail below.
Rotenberg WR-FD
Skipper
Multiscalar/IMT
Trace Processor (Not covered)
--------------------------------------------------------------------------------
Rotenberg WR-FD
Rotenberg, Jacobson, Smith; HPCA 1999
Published with control independence limit study.
Key Details
Fetch is along predicted path (WR).
On misprediction squash to reconvergence point (if reached).
Ignore unknown dependencies.
Reconvergence point detected by compiler or pre-processing software.
Recovery
Re-map source registers, re-issue if necessary.
Re-issue loads.
Re-mapping occurs at same rate as original mapping.
Reconvergence
On misprediction scan reorder buffer for reconvergence point.
Simulation
16-way superscalar DS, ROB 128 to 512 entries.
Front End: Ideal fetch, 64 ki entry gshare, JTB, RAS.
Data Cache: L1 64 kiB, 4-way, 2 cycle; L2: perfect, 14 cycle hit.
Store/Load Disambiguation: Execute load, re-execute on error.
Systems
Base (Ordinary system).
CI
CI-I: Zero-cycle recovery.
Benchmarks: selected SPEC95 (mostly integer).
Simulation Results
Improvement of CI about 10%.
CI-I is not much faster.
Over 50% of mispredictions reconverge.
About 50 instructions after reconvergence point (at resolve?).
Simulation Limitations
Impact of Branch Resolution Time
Important because that determines how many CI instructions available.
Overstated:
Fetch unrealistically fast: 16-way and perfect instruction supply.
Understated:
Pipeline depth short compared to current processors.
Two may cancel each other.
Impact of L2 Cache Misses
Not modeled since cache is perfect.
--------------------------------------------------------------------------------
Skipper
Cher and Vijaykumar, Micro 1999
Skipper
CI exploitation technique.
Key Details
Completely implemented in hardware.
At some branches fetch skips to reconvergence point.
Instructions with unknown dependencies not executed.
Use confidence estimator to choose branch.
Reconvergence Point Detection
Troublesome branches found by confidence estimator.
Reconvergence point guessed based on branch direction:
Forward: target of jump just before branch target.
Backward: fall-through instruction.
Definitions
S-Thread: (class) thread running skipped instructions.
SCIT: Skipped Computation Information Table
Table storing information about skipable instructions.
Indexed By: PC of troublesome branch.
Fields:
Reconvergence PC.
Maximum number of skipped instructions.
S-thread input registers. (Main thread registers read by s-thread.)
S-thread output registers. (s-thread registers read by main thread.)
All information in SCIT is approximate ...
because paths may be taken differently when entry used.
Skipping (Overview)
Branch Encountered, Maybe Skip
Confidence estimator assigns low confidence to branch prediction.
Retrieve entry from SCIT using PC of branch.
Copy current register map for S-thread.
Update main-thread register map..
using predicted S-thread output registers..
and pre-assigned physical registers.
Create gap in ROB and LSQ for S-thread.
In main thread jump to reconvergence point..
using updated register map.
Continue main thread until branch resolves.
Main thread uses ROB and LSQ entries after the respective gaps.
Branch Resolves:
Start S-thread, use ROB and LSQ entries in the respective gaps.
Execute S-thread until reconvergence PC reached.
Reconvergence PC Reached.
Inject pmove instructions: pmove PD, RS
RS: an architected register, renamed using S-thread map (at end of S-thread).
PD: the physical register preassigned to RS before S-thread started.
Switch back to main thread, skipping complete.
Simulated Configuration
Eight-way superscalar DS; 128-, 256-entry ROB; 43-entry LSQ.
Front End: Predict and fetch two branches per cycle (256 ROB version)
Nine-cycle mispredict penalty.
Cache: L1 64 kiB, 2-cycle; L2 2-MiB 12-cycle, Memory 100-cycle.
SCIT: 128 entries.
Maximum gap: 64 insn.
Confidence Estimator: JRS, 4 bits,
Simulation Results
Roughly 8% speedup.
Predictions of reconvergence point accurate.
About 25% of mispredicted branches skipped -- on the low side.
Reasons given: gap length (reconvergence too far away).
--------------------------------------------------------------------------------
Multiscalar/IMT
Multiscalar: Sohi, Breach, and Vijaykumar, ISCA 1995
IMT: Park, Falsafi, and Vijaykumar, ISCA 2003
Multiscalar Goals
Faster clock frequency by dividing processor into small clusters ...
with each cluster executing a contiguous portion of program (task).
Improved performance by exploiting control independence.
IMT Goals
Exploit control independence.
Unlike Multiscalar, uses one large processor core.
Common Ideas
Compiler divides program into tasks.
Hardware executes several tasks in parallel ...
Multiscalar: one task per processing unit.
IMT: one task per thread.
Fetch and decode switches between tasks.
Tasks
Small parts of program, can be several basic blocks.
Each task has a single entry point but can have multiple exits.
Identified by compiler.
Task Descriptor determined for each task containing:
Create Mask: List of live-out (later-used) registers.
Use Mask: List of registers read by task that were written by insn before task.
Information about task exits. (Used to predict next task.)
Instructions have the following extra information:
Stop "Bit": Indicator that instruction is last in task.
Forward Bit: Indicator that dest. register is a live out.
Added Instruction
Release: Sends register to next task.
Used when forward bit wouldn't work.
Task Creation (Compiler Pass)
Done after compiler produces ordinary assembler code.
Decide on maximum task size (instructions).
Decide on maximum task exits.
Start with Control-Flow Graph (CFG)
Nodes are basic blocks.
Arcs are paths.
Combine nodes so that:
Maximum number of insn and exits not exceeded.
Favor blocks fewer external dependencies.
Favor blocks containing reconvergence points.
Multiscalar Processing Unit (PU)
Multiscalar processor has several PUs.
Small processor, has own register file.
One task per PU.
Live out register values sent between PUs ..
there is no shared register file.
Special structure, ARB, for coordinating loads and stores.
Multiscalar Sequencer
Tasks predicted using local history predictor.
Other predictors used in later Multiscalar papers.
Multiscalar Simulations
Comparison
1- and 2-way static and dynamically scheduled superscalar.
4 (4x1)-, 8 (8x1,4x2)- and 16 (8x2)-way Multiscalar.
Cache: 64 kiB, 1-cycle latency; memory 13-cycle latency.
Benchmarks: SPEC92 integer and fp, and other integer.
Multiscalar Results
2-way ds ss v. 8x2-way multiscalar.
For integer SPEC 92 benchmarks, speedup 1.43
Small utilities and synthetic benchmark have higher speedup.
(/ (+ 1.33 2.64 1.24 0.95 1.5 0.9) 6) = 1.4267
About 85% task prediction accuracy.
Comments:
Multiscalar system has eight times the decode issue width (and cost).
Hard to draw conclusions without simulation of 8- and 16-way ss systems.
Cache miss latency is too low, speedups may have been lower.
Multiscalar Shortcomings
Requires evenly distributed critical path...
execution slows at critical path clumps.
This is because resources are partitioned into PUs.
IMT - Implicit Multithreading
Start with code compiled for multiscalar.
Run on something like an SMT (simultaneous multithreaded processor).
Key Differences Between IMT and Multiscalar
Functional units are shared between tasks.
Fetch based on predicted dependence ...
and resource usage.
Non-Speculative Task
The oldest in-flight task is non-speculative.
DRP - Dynamic Resource Predictor
Predicts ROB slots and physical regs needed.
Used to determine whether to fetch a speculative task.
ITDH - Inter-Thread (task) Dependence Predictor
Three consecutive identical tasks:
Assumed to be loop iterations.
Are predicted to be independent.
Anything else dependent.
IMT Fetch
If tasks dependent or if not enough resources ...
fetch only non-speculative (oldest) task.
Otherwise fetch task with fewest in-flight instructions.
IMT Simulations
All: Eight-way dynamically scheduled.
Base: 1024-entry ROB.
All: Two predictions per cycle.
IMT: 64-entry DRP, 3-entry ITDH
Benchmarks: SPEC2000 integer and FP.
IMT Results
Integer: 20% speedup, FP 30% speedup.
--------------------------------------------------------------------------------
Conclusions
Potential
Large, based on limit study.
Rotenberg WR-nFD
Details on finding reconvergence not worked out.
Skipper
Small speedup, moderate-to-high hardware complexity.
Probably won't be implemented. (2003)
IMT
Good speedup.
Moderate to high hardware complexity.
Requires special compilation or post-processing.
--------------------------------------------------------------------------------
Study Goals
Control Independence Concept
Know what it is and why one might want to exploit it.
Understand how critical path determines how well it can be exploited.
Understand how dependencies complicate things.
For Each Technique
Show a favorable or unfavorable code example ...
and how they execute.
Describe the special tables used and for each answer:
How is it indexed? (PC of decoding instruction, etc.)
What information is in each entry and what is the info used for?
When is the table read?
When is the table updated?
Show other special hardware used.
Show how the hardware exploits control independence.
Explain the technique's shortcomings.