LSU EE 7722 - Accelerator Descriptions

Descriptions of accelerators or ideas for accelerators, past, present, and future. The papers and other items linked to this page will be used in class.

Modern GPU Descriptions

Tensor Processing / Neural Network Accelerator Descriptions

Xeon Phi Accelerator Descriptions

Proposed Accelerator Microarchitecture Features

Proposed GPU / Accelerator Designs

Older GPU Descriptions

Modern GPU Descriptions

NVIDIA Hopper Generation (CC 9.0)

NVIDIA Hopper (CC 9.0) Architecture

A description of the Hopper architecture, CC 9.0.

NVIDIA H100 Tensor Core GPU Architecture (v1.02) 2022 (7.88 MB PDF)

NVIDIA Ampere Generation (CC 8.0)

NVIDIA Ampere (CC 8.0) Architecture

A description of the Amphere architecture, CC 8.0.

NVIDIA A100 Tensor Core GPU Architecture (v1.0) 2020 (7.70 MB PDF)

NVIDIA Turing Generation (CC 7.5)

NVIDIA Turing (CC 7.5) Architecture

A description of the Turing architecture, CC 7.5, and the initial TU102 implementation. In addition to Volta's memory system differences, Turing adds support for real-time ray tracing and low-precision integer tensor unit operations. [Why is it that the scientific computing community refuses to downgrade from 64-bit floating point, while the ML community has no problems making low precision operations work?]

NVIDIA Turing GPU Architecture (2018) (16.8 MB PDF)

NVIDIA Volta Generation (CC 7.0)

NVIDIA Volta V100 (CC 7.0) Architecture, GV100 Implementation

A description of the Volta architecture, CC 7.0, and the initial GV100 implementation. There are several big differences with the previous Pascal generation, including tensor cores to support DNN machine learning workloads, an L1 cache (not just read-only), a better global memory model, and better thread synchronization.

NVIDIA Tesla V100 (August 2017) (3.86 MB PDF)

NVIDIA Pascal Generation (CC 6.x)

NVIDIA Pascal (CC 6.x) Architecture, GP100 Implementation, and Tesla P100 Accelerator

A description of the Pascal architecture, CC 6.x, (actually just 6.0), the initial GP100 implementation, and the Tesla P100 accelerator board. Double-precision is back! CC 6.0 devices can issue DP insn at half the rate of SP, half-precision at twice the rate of SP. Attention 99%: don't rush to Best Buy, the initial offering (the P100) has an MSRP of $129,000.

NVIDIA Tesla P100 (Version 01) (2.69 MB PDF)

NVIDIA Maxwell Generation (CC 5.x)

NVIDIA Maxwell GM204 and GeForce GTX 980

A detailed description of the GM204 chip and the GTX 980 board, one of the first Maxwell GPUs, implementing CC 5.0 and intended for consumer use. In some ways Maxwell is a simplified version of Kepler: The number of SP functional units per SM was reduced from 192 to 128, removing the annoying 1.5 warps epr cycle per scheduler case. Also, shared memory uses its own storage, it's no longer using the same storage as the L1 cache.

NVIDIA, GeForce GTX 980, NVIDIA whitepaper. (2.24 MB PDF)

NVIDIA Kepler Generation (CC 3.x)

NVIDIA Kepler GK110

A description of an NVIDIA Kepler GPU meant for scientific computation, one that implements CC 3.5. Differences with the Fermi (CC 2.x) generation are emphasized.

NVIDIA, "NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110," whitepaper V1.0. (1.76 MB PDF)

NVIDIA Kepler GK104 and GeForce GTX 680

A detailed description of the GK104 chip and GTX 680, one of the first Kepler GPUs, implementing CC 3.0 and intended for consumer use. Full of useful facts, descriptions, and comparisons. The marketing and legal teams must have been on vacation when it was time to review it.

NVIDIA, GeForce GTX 680, NVIDIA whitepaper. (1.44 MB PDF)

NVIDIA Fermi Generation (CC 2.x)

NVIDIA Fermi

A reasonably technical description of Fermi, the first big change in NVIDIA GPU architecture since the G80. This whitepaper emphasizes non-graphical applications.

NVIDIA, “NVIDIA's Next Generation CUDA Compute Architecture: Fermi,” NVIDIA Whitepaper V1.1, 2010. (877 kB PDF)

NVIDIA GF100

Another reasonably technical description of the Fermi, with more emphasis on its use for graphics.

NVIDIA, “NVIDIA GF100, World's Fastest GPU Delivering Great Gaming Performance with True Geometric Realism,” NVIDIA Whitepaper V1.5. (1.08 MB PDF)

ATI GPSs

ATI R5xx Series

A detailed description of the CPU/GPU interface, intended for device driver writers. Could use some editing.

AMD, "Radeon R5xx Acceleration," manufacturer documentation. (3.13 MB PDF)

Tensor Processing / Neural Network Accelerator Descriptions

Production Accelerators

Design Considerations for Google's Tensor Processing Unit

A description of Google's TPU accelerator designed for inference workloads in Google's data centers. The paper looks at how expected workload (mostly FC and RNN layers, few CNN layers), reliably low-latency requirements (okay Google, the impatient thank you), and design time constraints affected its design. Its performance is compared with CPUs and GPUs, and some design-space exploration results are presented.

Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and Boyle, Rick and Cantin, Pierre-luc and Chao, Clifford and Clark, Chris and Coriell, Jeremy and Daley, Mike and Dau, Matt and Dean, Jeffrey and Gelb, Ben and Ghaemmaghami, Tara Vazir and Gottipati, Rajendra and Gulland, William and Hagmann, Robert and Ho, C. Richard and Hogberg, Doug and Hu, John and Hundt, Robert and Hurt, Dan and Ibarz, Julian and Jaffey, Aaron and Jaworski, Alek and Kaplan, Alexander and Khaitan, Harshit and Killebrew, Daniel and Koch, Andy and Kumar, Naveen and Lacy, Steve and Laudon, James and Law, James and Le, Diemthu and Leary, Chris and Liu, Zhuyuan and Lucke, Kyle and Lundin, Alan and MacKean, Gordon and Maggiore, Adriana and Mahony, Maire and Miller, Kieran and Nagarajan, Rahul and Narayanaswami, Ravi and Ni, Ray and Nix, Kathy and Norrie, Thomas and Omernick, Mark and Penukonda, Narayana and Phelps, Andy and Ross, Jonathan and Ross, Matt and Salek, Amir and Samadiani, Emad and Severn, Chris and Sizikov, Gregory and Snelham, Matthew and Souter, Jed and Steinberg, Dan and Swing, Andy and Tan, Mercedes and Thorson, Gregory and Tian, Bo and Toma, Horia and Tuttle, Erick and Vasudevan, Vijay and Walter, Richard and Wang, Walter and Wilcox, Eric and Yoon, Doe Hyun, “In-Datacenter Performance Analysis of a Tensor Processing Unit”, in the Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017, pp. 1-12.

Xeon Phi Accelerator Descriptions

Xeon Phi overview and Detailed Microarchitecture Description

The book provides a technically detailed overview of the Phi. See Chapters 3 and 4 for a description of the core microarchitecture.

Rezaur Rahman and Leonardo Borges, “Intel Xeon Phi Coprocessor Architecture and Tools,” May 2013. (6.67 MB PDF)

Xeon Phi Instruction Set Architecture

Intel's reference for the Xeon Phi ISA.

Intel, “Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual,” September 2012. (2.37 MB PDF)

Description of Intel's Larrabbee GPU

A description of what Intel hoped would be the next step in GPU evolution. It was originally aimed at both graphics and non-graphical applications, but the first product (Intel Xeon Phi) targets only non-graphical applications. Though this is an early description, it is detailed and many important features have not changed.

Seiler, et al, “Larrabee: A many-core x86 architecture for visual computing,” ACM Transactions on Graphics, vol. 27, no. 3, article 18, August 2008. (2.21 MB PDF)

Proposed Accelerator Microarchitecture Features

A technique to avoid some “wasted” storage in NVIDIA GPU designs.

NVIDIA GPUs provide a set of registers for each thread, while striving to keep warps of threads at the same point in program execution. One of these registers can hold values that are identical for each thread in a warp, for example, loop indices of loops that start at zero, or which differ by some constant times the lane number. The paper analyzes miciroarchitectural features to exploit such value structure in Fermi-like GPU designs.

Ji Kim, Christopher Torng, Shreesha Srinath, Derek Lockhart, and Christopher Batten, “Microarchitectural Mechanisms to Exploit Value Structure in SIMT Architectures”, International Symposium on Computer Architecture, 2013. (836 kB PDF)

Proposed GPU / Accelerator Designs

Description of Echelon, NVIDIA's proposed hybrid CPU/GPU design.

This paper describes a proposed future hybrid CPU/GPU design, Echelon. In this design energy is just as much a resource constraint as chip area and pin bandwidth. The chip includes both CPU and GPU cores, as is already the case in some commercial products, but here the GPU cores are much more flexible than Kepler-generation cores. For additional information see Steve Keckler's keynote address slides from IISWC 2011.

Keckler, Dally, Khailany, Garland, and Glasco, “GPUs and the Future of Parallel Computing,” IEEE Micro, vol. 31, no. 5, September 2011, pp. 7-17. (662 kB PDF)

Description of Echelon, NVIDIA's proposed hybrid CPU/GPU design.

Another look at the Echelon design, evaluating it in terms of some DoD proxy applications. The design is similar to the one proposed in the 2011 paper, with minor differences such as the elimination (or downplaying of) spatial warps.

Oreste Villa, Daniel R. Johnson, Mike O’Connor, Evgeny Bolotin, David Nellans, Justin Luitjens, Nikolai Sakharnykh, Peng Wang, Paulius Micikevicius, Anthony Scudiero, Stephen W. Keckler and William J. Dally, "Scaling the Power Wall: A Path to Exascale", SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 2014, pp. 830-841. (555 kB PDF)

Older GPU Descriptions

NVIDIA GeForce 8800 GPU

Relatively little detail but does cover all major parts of the processor. More information can be gleaned from the CUDA documentation. Also see the GeForce-8800-specific OpenGL extensions.

NVIDIA, “NVIDIA GeForce 8800 GPU Architecture Overview,” NVIDIA Technical Brief TB-02787-001_v01, November 2006. (3.73 MB PDF)

NVIDIA GeForce6800 GPU

A reasonably detailed description.

John Montrym and Henry Moreton, “The GeForce 6800,” IEEE Micro Magazine, vol. 25, no. 2, March 2005, pp. 41-51. (568 kB PDF)

David M. Koppelman - koppel@ece.lsu.edu	Modified 4 Nov 2022 18:03 (2303 UTC)
Provide Website Feedback • Accessibility Statement • Privacy Statement