EE 7700-1 - References

References for material covered in lectures and needed to complete the assignments. This page will be updated as the semester progresses. Next to each lecture set is a year. If the year is in the past (<2003) then that material might be changed.

When available, links are provided to full-text versions of the material. If a dialog pops up asking for a username and password use "ee4720" as the username and use the password given in class. This does not apply to the ACM digital library material. Some of the material is copyrighted and requires a subscription (e.g., ACM Digital Library) or one-time payment for access.

Critical Path (2007)
Performance Limits (2003)
Branch Prediction Techniques (2003)
Multiple Branch Prediction and Non-Contiguous Instruction Fetch (2003)
Data Prefetch (2003)
Pre-Execution (2007)
Control Independence Exploitation (2003)
Simultaneous Multithreading (2003)
Critical Path Compression (2007)
Computer Architecture Review Reading (2000)
Homework and Project Software (2000)
Lecture Set 1 -- Simulation (2000)
Lecture Set 2 -- Branch and Data Prediction (1998)
Lecture Set 3 -- Advanced Superscalar Techniques (1998)
Lecture Set 4 -- Multithreaded and Related Machines (1998)

Critical Path (2007)

Critical path definition suitable for dynamically scheduled systems. Also describes a critical path predictor and shows its effectiveness for data prediction and instruction steering.

Brian Fields, Shai Rubin, Rastislav Bodik, “Focusing Processor Policies via Critical-Path Prediction,” ISCA 2001, pp. 74-80. (181 kB PDF)

Slack definition, prediction, and application.

Brian Fields, Rastislav Bodik, Mark Hill, “Slack: Maximizing Performance Under Technological Constraints,” ISCA 2002, pp. 47-58. (168 kB PDF)

A framework for analyzing the interaction between events contributing to the critical path. Analysis shows interaction between classes of events such as cache misses and branch mispredictions but readers of the previous two papers may be disappointed to discover that there is no prediction of specific sets of events that might be used to guide some always-costs-sometimes-benefits feature such as pre-execution. Nevertheless, still interesting.

Brian Fields, Rastislav Bodik, Mark Hill, Chris Newburn, “Interaction Cost and Shotgun Profiling,” ACM Transactions on Architecture and Code Optimization, Vol. 1, No. 3, September 2004, pp. 272-304. (648 kB PDF)

Performance Limits (2003)

Recommended. Presents the performance of an ideal processor, limited only by data dependencies, and the performance of realistic systems. Performance of the ideal processor is the one labeled "perfect", in Figure 18. (The perfect processor in other figures is not as perfect, for example, having a smaller window.) The paper also has a good summary of important aspects of dynamically scheduled systems.

David W. Wall, “Limits of Instruction-Level Parallelism,” Technical Report, Digital WRL-93/6, November 1993. (464 kB PDF)

Branch Prediction Techniques (2003)

Papers mentioned in notes (2003)

Recommended. A good paper to review basic techniques. In addition, describes a hybrid predictor and also the gshare modification to Yeh & Patt's GAp predictor. Simulation data for several two-level predictors using tagless direct-mapped tables.

Scott McFarling, “Combining branch predictors,” Digital Equipment Corporation WRL Technical Note TN-36, June 1993. (118 kB PDF)

Analysis of how different branch behavior (whether mostly one-way) affects predictor performance and hybrid predictor designs based on the analysis.

Po-Yung Chang, Eric Hao, Tse-Yu Yeh, and Yale Patt, “Branch classification: a new mechanism for improving branch predictor performance,” in Proceedings of the Proceedings of the 27th Annual International Symposium on Microarchitecture, November 1994, pp. 22-31. (html)

Hybrid predictor. Unlike McFarling's, uses more than two predictors and focuses on the warmup problems introduced by context switching not examined in other papers.

Marius Evers, Po-Yung Chang, and Yale N. Patt, “Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches,” in Proceedings of the Proceedings of the 23th annual international symposium on computer architecture, May 1996, pp. 3-11. (917 kB PDF)

Dynamically adjust the size of the GHR.

Pierre Michaud, André Seznec, and Richard Uhlig, “Trading conflict and capacity aliasing in conditional branch predictors,” in Proceedings of the Proceedings of the 24th annual international symposium on computer architecture, 1997, pp. 292-303. (1.61 MB PDF)

The bi-mode predictor, a gshare-like predictor with separate tables for biased (mostly) taken and not-taken branches.

Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge, “The bi-mode branch predictor,” Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, December 1997, pp. 4-13. (html)

YAGS Predictor. Like bi-mode, except PHTs also store tags and are only used if bimodal mispredicts.

A. N. Eden and T. Mudge, “The YAGS branch prediction scheme,” International Symposium on Microarchitecture, December 1998, pp. 69-77. (891 kB PDF)

Exploit aliasing by determining if branch direction agrees with others.

Eric Sprangle, Robert S. Chappell, Mitch Alsup, and Yale N. Patt, “The agree predictor: a mechanism for reducing negative branch history interference,” in Proceedings of the International Symposium on Computer Architecture, June 1997, pp. 284-291. (html)

A gshare-like predictor with a dynamically adjusted outcome history size.

Toni Juan, Sanji Sanjeevan, and Juan J. Navarro, “Dynamic history-length fitting: a third level of adaptivity for branch prediction,” in Proceedings of the International Symposium on Computer Architecture, June 1998, pp. 155-166. (1.28 MB PDF)

Papers Mentioned in Class (but not yet in notes)

Other Branch Prediction Papers

Two-level branch predictors, including what is called the global and local predictors in class. A widely cited paper, many later designs have been based on these. Simulations are performed for set-associative tables (tagless direct-mapped tables used elsewhere).

Tse-Yu Yeh and Yale N. Patt, “A comparison of dynamic branch predictors that use two levels of branch history,” in Proceedings of the International Symposium on Computer Architecture, May 1993, pp. 257-266. (html)

History consists of target addresses rather than branch directions. An interesting idea, but not much better than existing schemes.

Shlomo Reches and Shlomo Weiss, “Implementation and analysis of path history in dynamic branch prediction schemes,” IEEE Transactions on Computers, vol. 47, no. 8, pp. 907-912, August 1998.

Multiple Branch Prediction and Non-Contiguous Instruction Fetch (2003)

An early multiple branch predictor paper.

Tse-Yu Yeh, Deborah T. Marr, and Yale N. Patt, “Increasing the instruction fetch rate via multiple branch prediction and a branch address cache,” in Proceedings of the International Conference on Supercomputing, 1993, pp. 67-76. (1.13 MB PDF)

A front end using a superblock predictor. Can predict more than one basic block per cycle if those blocks contain highly biased not taken branches. The contribution of this paper is the two-level design (not unlike a two-level cache) of the predictor and the queuing of predictions, not the predictor itself.

Glenn Reinman, Brad Calder, and Todd Austin, “Optimizations enabled by a decoupled front-end architecture,” IEEE Transactions on Computers, vol. 50, no. 4, pp. 338-355, April 2001. (2.76 MB PDF)

Description of the trace cache.

Eric Rotenberg, Steve Bennett, and James E. Smith, “A trace cache microarchitecture and evaluation,” IEEE Transactions on Computers, vol. 48, no. 2, pp. 111-120, February 1999. (801 kB PDF)

Description of a next-trace predictor. Predicts a trace ID, in effect predicting an instruction address and several following branch outcomes. Next-trace predictor uses several different techniques to get very high accuracy. Cost of predictor very high.

Quinn Jacobson, Eric Rotenberg, James E. Smith, “Path-Based Next Trace Prediction”, in Proceedings of the Proceedings of the annual international symposium on microarchitecture, 1997, pp. 14-23. (121 kB PDF)

Data Prefetch (2003)

A survey of data prefetch techniques.

Steven P. Vanderwiel and David J. Lilja, “Data prefetch mechanisms,” ACM Computing Surveys, vol. 32, no. 2, June 2000. (96.2 kB PDF)

Pre-Execution (2007)

Papers Discussed in Class

A complete implementation of hardware pre-execution (called pre-computation in the paper). Loads that spend time at the ROB head are pre-execution candidates. P-thread constructor looks for loop iterations (as opposed to a linear stream of instructions without regard to whether a static instruction appears multiple times).

Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John P. Shen, “Dynamic Speculative Precomputation,” in International Symposium on Microarchitecture, December 2001, pp. 306-317. (96.5 kB PDF)

A careful look at how pre-execution thread construction and selection methods affect performance. Selection is based on the latency hiding potential and overhead of threads individually and in groups. Study also examines impact of p-thread optimization. Emphasis is on studying selection techniques, not on realizable ways of doing so.

Amir Roth and Gurindar Sohi, “A quantitative framework for automated pre-execution thread selection,” in Proceedings of the 35th Annual International Symposium on Microarchitecture, pp. 430-441, 2002. Link is to similar technical report: University of Pennsylvania, Department of Computer and Information Science Technical Report MS-CIS-02-23 (213 kB PDF)

An implementation of hardware pre-execution which requires a profiling step to construct p-threads (unlike Collins which constructs p-threads in hardware). Both loads and branches are pre-executed and results from instructions in the p-thread are re-integrated into the main thread (unlike Collins).

Amir Roth and Gurindar Sohi, “Speculative data-driven multithreading,” in Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, 2001. (317 kB PDF)

Other Interesting Pre-Execution Papers

A study providing evidence that pre-execution can work. The performance-degrading instructions are loads that frequently miss and branches that are frequently mispredicted. A backward slice is the set of instructions (except for CTI) needed for to execute a performance degrading instruction. The study shows that the backward slice is small (or sparse) enough to make pre-execution possible.

Craig Zilles and Gurindar Sohi, “Understanding the backward slices of performance degrading instructions,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000, pp. 172-181. (115 kB PDF)

Pre-exeuction in which p-threads are launched by compiler-inserted instructions. Uses Itanium instructions.

Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher Hughes, Yong-Fong Lee, Dan lavery, and John P. Shen, “Speculative Precomputation: Long-range Prefetching of Delinquent Loads,” in Proceedings of the 28th Annual International Symposium on Computer Architecture, June 2001, pp. 14-25. (144 kB PDF)

Control Independence Exploitation (2003)

Control independence study. Also available as a more complete technical report.

Eric Rotenberg, Quinn Jacobson, and Jim Smith, “A study of control independence in superscalar processors,” in Proceedings of the Proceedings of the 5th Symposium on High Performance Computer Architecture, January 1999. (91.7 kB PDF)

At a difficult-to-predict branch start fetching at reconvergence point, that is, if the hardware remembers the branch and point. Hardware-only implementation.

Chen-Yong Cher and T. N. Vijaykumar, “Skipper: A Microarchitecture For Exploiting Control-flow Independence,” in Proceedings of the Thirty-Fourth Annual IEEE/ACM International Symposium on Microarchitecture, December 2001, pp. 4-15. (123 kB PDF)

Multiscalar: A compiler chops up what would be a single thread of control to many small tasks. With the program in this form multiscalar hardware can easily exploit control independence (and can perhaps more easily overlap instruction execution).

Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar, “Multiscalar processors,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995, pp. 414-425. (95.0 kB PDF)

A big refinement of multiscalar in which scheduling of the tasks is based on predicted dependencies.

Il Park, Babak Falsafi, T. N. Vijaykumar, “Implicitly-multithreaded processors,” in Proceedings of the 30th Annual International Symposium on Computer Architecture, pp. 39-51, June 2003, (133 kB PDF)

An implementation of eager execution in which execution proceeds down both directions of branches that a confidence estimator deems hard to predict.

Artur Klauser, Abhijit Paithankar, and Dirk Grunwald, “Selective eager execution on the PolyPath architecture,” in Proceedings of the International Symposium on Computer Arch., June 1998, pp. 250-259. (111 kB PDF)

Description of trace processor using several aggressive techniques, including value prediction. Includes comparison with a higher cost system: a superscalar processor with similar prediction capabilities and issue bandwidth.

Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, and Jim Smith, “Trace processors,” in Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, December 1997, pp. 138-148. (92.1 kB PDF)

Simultaneous Multithreading (2003)

Description of simultaneous multithreading (SMT), something like what Intel calls hyperthreading. Analysis of how variations in fetch and thread scheduling mechanisms affect performance. Simulations use multiprocessing workload (each thread from a different program).

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm, “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” in Proceedings of the International Symposium on Computer Architecture, May 1996, pp. 191-202. (282 kB PDF)

Primary. Comparison of SMT with a single-chip multiprocessor having roughly equivalent total resources. Looks at resource use inefficiency in multiprocessor and impact of SMT on memory system, and other effects. Simulations use parallel workload (threads all part of one program).

Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen, “Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading,” ACM Transactions on Computer Systems, vol. 15, no. 3, pp. 322-354, August 1997. (137 kB PDF)

Critical Path Compression (2007)

Dynamic Optimization

A description of rePlay. First find frequently taken paths through the program, then produce an optimized version of that path and cache it for next time.

Brian Fahs, Satarupa Bose, Matthew Crum, Brian Slechta, Francesco Spadini, Tony Tung, Sanjay J. Patel and Steven S. Lumetta, “Performance Characterization of a Hardware Mechanism for Dynamic Optimization,” in the proceedings of the International Symposium on Microarchitecture, December 2001, pp. 16-27. (164 kB PDF)

Optimize trace before storing it in trace cache. Optimization done on smaller sections of code than rePlay (above).

Daniel H. Friendly, Sanjay J. Patel, Yale N. Patt, “Putting the Fill Unit to Work: Dynamic Optimizations for Trace Cache Microprocessors,” in the proceedings of the International Symposium on Microarchitecture, probably December 1998, pp. 173-181. (93.4 kB PDF)

Divide an ordinary program into ≈ 128-instruction tasks. Using extreme (foolish anywhere else) optimizations distill a second program from the original one that can predict register values needed by the tasks. Run the distilled program on one core and tasks on the remaining cores. Tasks verify inputs provided by distilled code and may produce other architectural state not needed by the next few succeeding tasks.

Graig Zilles and Gurindar Sohi, "Master/Slave Speculative Parallelization," in the proceedings of the 35th International Symposium on Microarchitecture (Micro-35), November 20-22, 2002. (178 kB PDF)

Value Prediction and Computation Re-Use

Using value prediction to compute faster than the dataflow limit. Paper looks at prediction accuracy, but does not examine speedup.

Freddy Gabbay and Avi Mendelson, “Using Value Prediction to Increase the Power of Speculative Execution Hardware”, ACM Transactions on Computer Systems, vol. 16, no. 3, 1998, pp. 234-270. (428 kB PDF)

An examination of re-use.

Avinash Sodani and Gurindar S. Sohi, “Dynamic Instruction Reuse,” International Symposium on Computer Architecture, pp. 194-205, 1997.

Paper presents a trace re-use technique and presents a performance comparison of several forms of computation re-use: at the instruction, block, and trace level. Comparison is on a system with an unrealistic cache, especially when compared to the re-use tables. But, it's a comparison.

Amarildo T. da Costa and Felipe M. G. Fran\c{c}a and Eliseu M. C. Filho, “The Dynamic Trace Memoization Reuse Technique,” in the proceedings of the International Conference on Parallel Architectures Compilation Techniques, October 2003, pp. 92-99.

Computer Architecture Review Reading (2000)

Required. Description of the R10000 implementation of the MIPS architecture.

Kenneth C. Yeager, "The MIPS R10000 Superscalar Microprocessor", IEEE Micro,, vol. 16, no. 2, April 1996, pp. 28-40. Link only effective on an ECE Solaris system.

Description of the 21264 implementation of the Alpha architecture.

Richard. E. Kessler, "The Alpha 21264 Microprocessor," IEEE Micro, vol. 19, No. 2, March/April 1999.

Homework and Project Software (2000)

Recommended. User manual for Shade, a simulation tool needed for some homework assignments.

Introduction to Shade, Sun Microsystems, Jun 25, 1997. (88.6 kB PDF)

Reference. Manpages for shade.

Shade User's Manual, V5.33A, Sun Microsystems, 1998. (85.4 kB PDF)

Lecture Set 1 -- Simulation (2000)

Recommended. Information on Shade and overview of instrumentation and simulation techniques.

B. Cmelik and D. Keppel, "Shade: A fast instruction-set simulator for execution profiling," 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. (95.8 kB PDF)

Recommended. Description of instrumentation by random sampling using event counters.

Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T. Vandevoorde, Carl A. Waldspurger and William E. Weihl, "Continuous profiling: where have all the cycles gone?,'' ACM Transactions on Computer Systems, vol. 15, no. 4, November 1997, pp. 357-390.

Recommended. Description of SimOS, a processor simulator, and some sample applications.

Mendel Rosenblum, Edouard Bugnion, Scott Devine, and Steve Herrod, "Using the SimOS machine simulator to study complex computer systems," ACM Transactions on Modeling and Computer Simulation, vol. 7, no. 1, Jan. 1997, pp. 78-103. (PDF)

Secondary. An annotated listing of simulation and instrumentation systems. A useful place to find further information or to get a feel for the range of systems.

Don Pardo, "Instruction-Level Simulation And Tracing."

Lecture Set 2 -- Branch and Data Prediction (1998)

Recommended. Evaluation of several branch prediction confidence estimators.

Dirk Grunwald, Artur Klauser, Srilatha Manne, and Andrew Pleszkun, “Confidence estimation for speculation control,” in Proceedings of the International Symposium on Computer Architecture, June 1998, pp. 122-131. (html)

Recommended. Evaluation of best-case performance of several data prediction schemes.

Yinnakis Sazeides and James E. Smith, “The predictability of data values,” Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, December 1997, pp. 248-258. (html)

Lecture Set 3 -- Advanced Superscalar Techniques (1998)

Secondary. Study of performance limits of superscalar processors, used for the Hennessey & Patterson computer architecture text, section 4.7.

David W. Wall, “Limits of Instruction-Level Parallelism,” Technical Report, Digital WRL-93/6, November 1993. (464 kB PDF)

Recommended. Predicting load/store dependencies using store sets. Includes performance of systems that predict all pairs dependent and no pairs dependent.

George Z. Chrysos and Joel S. Emer, “Memory dependency prediction using store sets,” in Proceedings of the International Symposium on Computer Architecture, June 1998, pp. 142-153. (html)

Secondary. Forwarding data from store to load if dependency predicted.

Gary S. Tyson and Todd M. Austin, “Improving the accuracy and performance of memory communication through renaming,” Proceedings of the Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, December 1997, pp. 218-227. (html)

Lecture Set 4 -- Multithreaded and Related Machines (1998)

Primary. Description of simultaneous multithreading (SMT). Analysis of how variations in fetch and thread scheduling mechanisms affect performance. Simulations use multiprocessing workload (each thread from a different program).

Primary. Description of the Tera multithreaded processor.

Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith, “The Tera computer system,” in Proceedings of the International Conference on Supercomputing, June 1990, pp. 1-6. (187 kB PDF)

Secondary. Short Tera program example and description of system software.

Gail Alverson, Preston Briggs, Susan Coatney, Simon Kahan, and Richard Korry, “Tera hardware-software cooperation,” Supercomputing 97, November 1997. (204 kB PDF)

Secondary. Company web site. Some additional technical information and information about the company.

Tera (html)

David M. Koppelman - koppel@ece.lsu.edu

Modified 24 Apr 2007 15:25 (2025 UTC)