LSU EE 7700-2 -- Fall 2003 -- Multiprocessors, Etc.

Note Set 6 -- Control Independence Exploitation

Under Construction, To Do
  Update under construction list.

Contents
  Introduction
  Rotenberg WR-FD
  Skipper
  Multiscalar/IMT
  Trace Processor (Not covered)
  Conclusions
  Goals (For Studying)  

References
  Listed in each section.
  Linked to http://www.ece.lsu.edu/tca/ref.html

Glossary
  DS: Dynamically Scheduled. (a.k.a., out-of-order)
  JTB: (a.k.a. BTB)  A PC-indexed table listing predicted targets for indirect jumps.
  LSQ: Load/Store Queue.
  RAS: Return address stack. Used to predict returns, assumes standard call/return insn usage.
  ROB: Reorder Buffer (a.k.a., active list)
  Register Map: Table listing physical register assigned to each architected register.
  Store/Load Disambiguation: Determining whether a 
  SS: Superscalar.

--------------------------------------------------------------------------------
Introduction

 Under Construction (20% done)

 Reference
  Rotenberg, HPCA 99

 Status
  Experimental, implementations too complex and costly.

 Exploitation Features (Rotenberg Terminology)
  Out of order fetch (WR) or selective squash (nWR).
  Speculate (false) dependencies (nFD) or wait for branch resolution (FD).

 Other Terminology
  S-Thread: Thread running code that was skipped (so it is run later than usual).

 Potential Performance Improvement
  Depends on: fetch rate and path of critical path.
  Cases
   Execution following branch is fetch-limited.  (nWR better.)
   Critical path at branch target or fall-through. (WR better.)
  Example (Blackboard)

 Potential Speedup From Limit-Study Simulation
  About five times.  Wow!  Maybe.
  Assumptions
   Perfect fetch (16 insn/cycle, any number of branches).
   All loads have 1-cycle access.
   CTIs: 2^16 entry gshare, 2^16 entry JTB, RAS.
   Maybe 4-5 branch penalty.
   Unlimited execution resources.
   Perfect store/load address disambiguation.
   Perfect knowledge of reconvergence point and skipped code.

 Limit Study Conclusions
  (See Figure 3 graphs.)
  Oracle v. nWR-nFD:  gcc, for 2ki ROB about 2 IPC difference (large).
   nWR-nFD not as good as oracle because:
    Critical path passes through skip thread.
    Reconvergence points cannot be found for some branches.
  nWR-nFD v. WR-nFD:   gcc, for 2ki ROB, almost 4 IPC difference (large)
   WR not as good as nWR because:
    Guess: Long skip thread and long resolution time.
  
 Register Re-Mapping Headaches
  T-Map: ID Register Map just after Trigger.
  R-Map: ID Register Map just before Reconvergence.
  nWR Headache: Have to predict R-Map.
  WR Headache: Have to predict R-Map or re-rename.
  These headaches make CI exploitation impractical (so far).

 Control Independence Exploitation Techniques 
  Described in detail below.
   Rotenberg WR-FD
   Skipper
   Multiscalar/IMT
   Trace Processor (Not covered)

--------------------------------------------------------------------------------
Rotenberg WR-FD

 Rotenberg, Jacobson, Smith;  HPCA 1999
  Published with control independence limit study.

 Key Details
  Fetch is along predicted path (WR).
  On misprediction squash to reconvergence point (if reached).
  Ignore unknown dependencies.
  Reconvergence point detected by compiler or pre-processing software.

 Recovery
  Re-map source registers, re-issue if necessary.
  Re-issue loads.
  Re-mapping occurs at same rate as original mapping.

 Reconvergence
  On misprediction scan reorder buffer for reconvergence point.

 Simulation
  16-way superscalar DS, ROB 128 to 512 entries.
  Front End: Ideal fetch,  64 ki entry gshare, JTB, RAS.
  Data Cache: L1 64 kiB, 4-way, 2 cycle; L2: perfect, 14 cycle hit.
  Store/Load Disambiguation: Execute load, re-execute on error.
  Systems
   Base (Ordinary system). 
   CI
   CI-I:  Zero-cycle recovery. 
  Benchmarks: selected SPEC95 (mostly integer).

 Simulation Results
  Improvement of CI about 10%.
  CI-I is not much faster.
  Over 50% of mispredictions reconverge.
  About 50 instructions after reconvergence point (at resolve?).

 Simulation Limitations

  Impact of Branch Resolution Time
   Important because that determines how many CI instructions available.
   Overstated:
    Fetch unrealistically fast: 16-way and perfect instruction supply.
   Understated:
    Pipeline depth short compared to current processors.
   Two may cancel each other.

  Impact of L2 Cache Misses
   Not modeled since cache is perfect.


--------------------------------------------------------------------------------
Skipper

 Cher and Vijaykumar, Micro 1999 

 Skipper
  CI exploitation technique.

 Key Details
  Completely implemented in hardware. 
  At some branches fetch skips to reconvergence point.
  Instructions with unknown dependencies not executed.
  Use confidence estimator to choose branch.

 Reconvergence Point Detection
  Troublesome branches found by confidence estimator.
  Reconvergence point guessed based on branch direction:
   Forward: target of jump just before branch target.
   Backward: fall-through instruction.

 Definitions
  S-Thread: (class) thread running skipped instructions.
 
 SCIT: Skipped Computation Information Table
  Table storing information about skipable instructions.
  Indexed By: PC of troublesome branch.
  Fields:
   Reconvergence PC.
   Maximum number of skipped instructions.
   S-thread input registers. (Main thread registers read by s-thread.)
   S-thread output registers. (s-thread registers read by main thread.)
  All information in SCIT is approximate ... 
   because paths may be taken differently when entry used.

 Skipping (Overview)
  Branch Encountered, Maybe Skip
   Confidence estimator assigns low confidence to branch prediction.
   Retrieve entry from SCIT using PC of branch.
   Copy current register map for S-thread.
   Update main-thread register map..
    using predicted S-thread output registers..
    and pre-assigned physical registers.
   Create gap in ROB and LSQ for S-thread.
   In main thread jump to reconvergence point..
    using updated register map.
   Continue main thread until branch resolves.
   Main thread uses ROB and LSQ entries after the respective gaps.
  Branch Resolves:
   Start S-thread, use ROB and LSQ entries in the respective gaps.
   Execute S-thread until reconvergence PC reached.
  Reconvergence PC Reached.
   Inject pmove instructions: pmove PD, RS
    RS: an architected register, renamed using S-thread map (at end of S-thread).
    PD: the physical register preassigned to RS before S-thread started.
   Switch back to main thread, skipping complete.

 Simulated Configuration
  Eight-way superscalar DS; 128-, 256-entry ROB; 43-entry LSQ.
  Front End: Predict and fetch two branches per cycle (256 ROB version)
  Nine-cycle mispredict penalty.
  Cache: L1 64 kiB, 2-cycle; L2 2-MiB 12-cycle, Memory 100-cycle.
  SCIT: 128 entries.
  Maximum gap: 64 insn.
  Confidence Estimator: JRS, 4 bits, 

 Simulation Results
  Roughly 8% speedup.
  Predictions of reconvergence point accurate.
  About 25% of mispredicted branches skipped -- on the low side.
   Reasons given:  gap length (reconvergence too far away).
  

-------------------------------------------------------------------------------- 
Multiscalar/IMT

 Multiscalar: Sohi, Breach, and Vijaykumar, ISCA 1995
 IMT: Park, Falsafi, and Vijaykumar, ISCA 2003

 Multiscalar Goals
  Faster clock frequency by dividing processor into small clusters ...
   with each cluster executing a contiguous portion of program (task).
  Improved performance by exploiting control independence.

 IMT Goals
  Exploit control independence.
  Unlike Multiscalar, uses one large processor core.

 Common Ideas
  Compiler divides program into tasks.
  Hardware executes several tasks in parallel ...
   Multiscalar: one task per processing unit.
   IMT: one task per thread.
  Fetch and decode switches between tasks.

 Tasks
  Small parts of program, can be several basic blocks.
  Each task has a single entry point but can have multiple exits.
  Identified by compiler.
  Task Descriptor determined for each task containing:
   Create Mask: List of live-out (later-used) registers.
   Use Mask: List of registers read by task that were written by insn before task.
   Information about task exits. (Used to predict next task.)
  Instructions have the following extra information:
   Stop "Bit": Indicator that instruction is last in task.
   Forward Bit: Indicator that dest. register is a live out.
  Added Instruction
   Release: Sends register to next task.
   Used when forward bit wouldn't work.

 Task Creation (Compiler Pass)
  Done after compiler produces ordinary assembler code.
  Decide on maximum task size (instructions).
  Decide on maximum task exits.
  Start with Control-Flow Graph (CFG)
   Nodes are basic blocks.
   Arcs are paths.
  Combine nodes so that:
   Maximum number of insn and exits not exceeded.
   Favor blocks fewer external dependencies.
   Favor blocks containing reconvergence points.

 Multiscalar Processing Unit (PU)
  Multiscalar processor has several PUs.
  Small processor, has own register file.
  One task per PU.
  Live out register values sent between PUs ..
   there is no shared register file.
  Special structure, ARB, for coordinating loads and stores.

 Multiscalar Sequencer
  Tasks predicted using local history predictor.
  Other predictors used in later Multiscalar papers.
 
 Multiscalar Simulations
  Comparison
    1- and 2-way static and dynamically scheduled superscalar.
    4 (4x1)-, 8 (8x1,4x2)- and 16 (8x2)-way Multiscalar.
  Cache: 64 kiB, 1-cycle latency; memory 13-cycle latency.
  Benchmarks: SPEC92 integer and fp, and other integer.

 Multiscalar Results
  2-way ds ss v. 8x2-way multiscalar.
   For integer SPEC 92 benchmarks, speedup 1.43
   Small utilities and synthetic benchmark have higher speedup.
   (/ (+ 1.33 2.64 1.24 0.95 1.5 0.9) 6) = 1.4267 
   About 85% task prediction accuracy.
  Comments:
   Multiscalar system has eight times the decode issue width (and cost).
    Hard to draw conclusions without simulation of 8- and 16-way ss systems.
   Cache miss latency is too low, speedups may have been lower.

 Multiscalar Shortcomings
  Requires evenly distributed critical path...
   execution slows at critical path clumps.
  This is because resources are partitioned into PUs.
  
 IMT - Implicit Multithreading
  Start with code compiled for multiscalar.
  Run on something like an SMT (simultaneous multithreaded processor).
 
 Key Differences Between IMT and Multiscalar
  Functional units are shared between tasks.
  Fetch based on predicted dependence ...
   and resource usage.

 Non-Speculative Task
  The oldest in-flight task is non-speculative.

 DRP - Dynamic Resource Predictor
  Predicts ROB slots and physical regs needed.
  Used to determine whether to fetch a speculative task.

 ITDH - Inter-Thread (task) Dependence Predictor
  Three consecutive identical tasks:
   Assumed to be loop iterations.
   Are predicted to be independent.
  Anything else dependent.

 IMT Fetch
  If tasks dependent or if not enough resources ...
   fetch only non-speculative (oldest) task.
  Otherwise fetch task with fewest in-flight instructions.

 IMT Simulations
  All: Eight-way dynamically scheduled.
  Base: 1024-entry ROB.
  All: Two predictions per cycle.
  IMT: 64-entry DRP, 3-entry ITDH
  Benchmarks: SPEC2000 integer and FP.
  
 IMT Results
  Integer: 20% speedup, FP 30% speedup.
  

-------------------------------------------------------------------------------- 
Conclusions

 Potential
  Large, based on limit study.

  Rotenberg WR-nFD
   Details on finding reconvergence not worked out.

  Skipper
   Small speedup, moderate-to-high hardware complexity.
   Probably won't be implemented. (2003)

  IMT
   Good speedup.
   Moderate to high hardware complexity.
   Requires special compilation or post-processing.

 
   
--------------------------------------------------------------------------------
Study Goals

 Control Independence Concept
  Know what it is and why one might want to exploit it.
  Understand how critical path determines how well it can be exploited.
  Understand how dependencies complicate things.

 For Each Technique
  Show a favorable or unfavorable code example ...
   and how they execute.
  Describe the special tables used and for each answer:
   How is it indexed? (PC of decoding instruction, etc.)
   What information is in each entry and what is the info used for?
   When is the table read?
   When is the table updated?
  Show other special hardware used.
  Show how the hardware exploits control independence.
  Explain the technique's shortcomings.