References for material covered in lectures and needed to complete the assignments. This page will be updated as the semester progresses. Next to each lecture set is a year. If the year is in the past (<2003) then that material might be changed.
When available, links are provided to full-text versions of the material. If a dialog pops up asking for a username and password use "ee4720" as the username and use the password given in class. This does not apply to the ACM digital library material. Some of the material is copyrighted and requires a subscription (e.g., ACM Digital Library) or one-time payment for access.
Critical path definition suitable for dynamically scheduled systems. Also describes a critical path predictor and shows its effectiveness for data prediction and instruction steering.
Slack definition, prediction, and application.
A framework for analyzing the interaction between events contributing to the critical path. Analysis shows interaction between classes of events such as cache misses and branch mispredictions but readers of the previous two papers may be disappointed to discover that there is no prediction of specific sets of events that might be used to guide some always-costs-sometimes-benefits feature such as pre-execution. Nevertheless, still interesting.
Recommended. Presents the performance of an ideal processor, limited only by data dependencies, and the performance of realistic systems. Performance of the ideal processor is the one labeled "perfect", in Figure 18. (The perfect processor in other figures is not as perfect, for example, having a smaller window.) The paper also has a good summary of important aspects of dynamically scheduled systems.
Recommended. A good paper to review basic techniques. In addition, describes a hybrid predictor and also the gshare modification to Yeh & Patt's GAp predictor. Simulation data for several two-level predictors using tagless direct-mapped tables.
Scott McFarling, “Combining branch predictors,” Digital Equipment Corporation WRL Technical Note TN-36, June 1993. (118 kB PDF)
Analysis of how different branch behavior (whether mostly one-way) affects predictor performance and hybrid predictor designs based on the analysis.
Hybrid predictor. Unlike McFarling's, uses more than two predictors and focuses on the warmup problems introduced by context switching not examined in other papers.
Dynamically adjust the size of the GHR.
The bi-mode predictor, a gshare-like predictor with separate tables for biased (mostly) taken and not-taken branches.
YAGS Predictor. Like bi-mode, except PHTs also store tags and are only used if bimodal mispredicts.
Exploit aliasing by determining if branch direction agrees with others.
A gshare-like predictor with a dynamically adjusted outcome history size.
Two-level branch predictors, including what is called the global and local predictors in class. A widely cited paper, many later designs have been based on these. Simulations are performed for set-associative tables (tagless direct-mapped tables used elsewhere).
History consists of target addresses rather than branch directions. An interesting idea, but not much better than existing schemes.
An early multiple branch predictor paper.
A front end using a superblock predictor. Can predict more than one basic block per cycle if those blocks contain highly biased not taken branches. The contribution of this paper is the two-level design (not unlike a two-level cache) of the predictor and the queuing of predictions, not the predictor itself.
Description of the trace cache.
Description of a next-trace predictor. Predicts a trace ID, in effect predicting an instruction address and several following branch outcomes. Next-trace predictor uses several different techniques to get very high accuracy. Cost of predictor very high.
A survey of data prefetch techniques.
A complete implementation of hardware pre-execution (called pre-computation in the paper). Loads that spend time at the ROB head are pre-execution candidates. P-thread constructor looks for loop iterations (as opposed to a linear stream of instructions without regard to whether a static instruction appears multiple times).
A careful look at how pre-execution thread construction and selection methods affect performance. Selection is based on the latency hiding potential and overhead of threads individually and in groups. Study also examines impact of p-thread optimization. Emphasis is on studying selection techniques, not on realizable ways of doing so.
An implementation of hardware pre-execution which requires a profiling step to construct p-threads (unlike Collins which constructs p-threads in hardware). Both loads and branches are pre-executed and results from instructions in the p-thread are re-integrated into the main thread (unlike Collins).
A study providing evidence that pre-execution can work. The performance-degrading instructions are loads that frequently miss and branches that are frequently mispredicted. A backward slice is the set of instructions (except for CTI) needed for to execute a performance degrading instruction. The study shows that the backward slice is small (or sparse) enough to make pre-execution possible.
Pre-exeuction in which p-threads are launched by compiler-inserted instructions. Uses Itanium instructions.
Control independence study. Also available as a more complete technical report.
At a difficult-to-predict branch start fetching at reconvergence point, that is, if the hardware remembers the branch and point. Hardware-only implementation.
Multiscalar: A compiler chops up what would be a single thread of control to many small tasks. With the program in this form multiscalar hardware can easily exploit control independence (and can perhaps more easily overlap instruction execution).
A big refinement of multiscalar in which scheduling of the tasks is based on predicted dependencies.
An implementation of eager execution in which execution proceeds down both directions of branches that a confidence estimator deems hard to predict.
Description of trace processor using several aggressive techniques, including value prediction. Includes comparison with a higher cost system: a superscalar processor with similar prediction capabilities and issue bandwidth.
Description of simultaneous multithreading (SMT), something like what Intel calls hyperthreading. Analysis of how variations in fetch and thread scheduling mechanisms affect performance. Simulations use multiprocessing workload (each thread from a different program).
Primary. Comparison of SMT with a single-chip multiprocessor having roughly equivalent total resources. Looks at resource use inefficiency in multiprocessor and impact of SMT on memory system, and other effects. Simulations use parallel workload (threads all part of one program).
A description of rePlay. First find frequently taken paths through the program, then produce an optimized version of that path and cache it for next time.
Optimize trace before storing it in trace cache. Optimization done on smaller sections of code than rePlay (above).
Divide an ordinary program into ≈ 128-instruction tasks. Using extreme (foolish anywhere else) optimizations distill a second program from the original one that can predict register values needed by the tasks. Run the distilled program on one core and tasks on the remaining cores. Tasks verify inputs provided by distilled code and may produce other architectural state not needed by the next few succeeding tasks.
Using value prediction to compute faster than the dataflow limit. Paper looks at prediction accuracy, but does not examine speedup.
An examination of re-use.
Paper presents a trace re-use technique and presents a performance comparison of several forms of computation re-use: at the instruction, block, and trace level. Comparison is on a system with an unrealistic cache, especially when compared to the re-use tables. But, it's a comparison.
Required. Description of the R10000 implementation of the MIPS architecture.
Description of the 21264 implementation of the Alpha architecture.
Recommended. User manual for Shade, a simulation tool needed for some homework assignments.
Introduction to Shade, Sun Microsystems, Jun 25, 1997. (88.6 kB PDF)
Reference. Manpages for shade.
Shade User's Manual, V5.33A, Sun Microsystems, 1998. (85.4 kB PDF)
Recommended. Information on Shade and overview of instrumentation and simulation techniques.
Recommended. Description of instrumentation by random sampling using event counters.
Recommended. Description of SimOS, a processor simulator, and some sample applications.
Secondary. An annotated listing of simulation and instrumentation systems. A useful place to find further information or to get a feel for the range of systems.
Recommended. Evaluation of several branch prediction confidence estimators.
Recommended. Evaluation of best-case performance of several data prediction schemes.
Secondary. Study of performance limits of superscalar processors, used for the Hennessey & Patterson computer architecture text, section 4.7.
David W. Wall, “Limits of Instruction-Level Parallelism,” Technical Report, Digital WRL-93/6, November 1993. (464 kB PDF)
Recommended. Predicting load/store dependencies using store sets. Includes performance of systems that predict all pairs dependent and no pairs dependent.
Secondary. Forwarding data from store to load if dependency predicted.
Primary. Description of simultaneous multithreading (SMT). Analysis of how variations in fetch and thread scheduling mechanisms affect performance. Simulations use multiprocessing workload (each thread from a different program).
Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L. Lo, and Rebecca L. Stamm, “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” in Proceedings of the International Symposium on Computer Architecture, May 1996, pp. 191-202.
Primary. Comparison of SMT with a single-chip multiprocessor having roughly equivalent total resources. Looks at resource use inefficiency in multiprocessor and impact of SMT on memory system, and other effects. Simulations use parallel workload (threads all part of one program).
Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen, “Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading,” ACM Transactions on Computer Systems, vol. 15, no. 3, pp. 322-354, August 1997.
Primary. Description of the Tera multithreaded processor.
Secondary. Short Tera program example and description of system software.
Secondary. Company web site. Some additional technical information and information about the company.
Tera (html)
David M. Koppelman - koppel@ece.lsu.edu | Modified 24 Apr 2007 15:25 (2025 UTC) |