EE 7722 - Accelerator Descriptions

Descriptions of accelerators or ideas for accelerators, past, present, and future. The papers and other items linked to this page will be used in class.

Modern GPU Descriptions

NVIDIA Turing Generation (CC 7.5)

NVIDIA Turing (CC 7.5) Architecture
A description of the Turing architecture, CC 7.5, and the initial TU102 implementation. In addition to Volta's memory system differences, Turing adds support for real-time ray tracing and low-precision integer tensor unit operations. [Why is it that the scientific computing community refuses to downgrade from 64-bit floating point, while the ML community has no problems making low precision operations work?]

NVIDIA Volta Generation (CC 7.0)

NVIDIA Volta V100 (CC 7.0) Architecture, GV100 Implementation
A description of the Volta architecture, CC 7.0, and the initial GV100 implementation. There are several big differences with the previous Pascal generation, including tensor cores to support DNN machine learning workloads, an L1 cache (not just read-only), a better global memory model, and better thread synchronization.

NVIDIA Pascal Generation (CC 6.x)

NVIDIA Pascal (CC 6.x) Architecture, GP100 Implementation, and Tesla P100 Accelerator
A description of the Pascal architecture, CC 6.x, (actually just 6.0), the initial GP100 implementation, and the Tesla P100 accelerator board. Double-precision is back! CC 6.0 devices can issue DP insn at half the rate of SP, half-precision at twice the rate of SP. Attention 99%: don't rush to Best Buy, the initial offering (the P100) has an MSRP of $129,000.

NVIDIA Maxwell Generation (CC 5.x)

NVIDIA Maxwell GM204 and GeForce GTX 980
A detailed description of the GM204 chip and the GTX 980 board, one of the first Maxwell GPUs, implementing CC 5.0 and intended for consumer use. In some ways Maxwell is a simplified version of Kepler: The number of SP functional units per SM was reduced from 192 to 128, removing the annoying 1.5 warps epr cycle per scheduler case. Also, shared memory uses its own storage, it's no longer using the same storage as the L1 cache.

NVIDIA Kepler Generation (CC 3.x)

NVIDIA Kepler GK110
A description of an NVIDIA Kepler GPU meant for scientific computation, one that implements CC 3.5. Differences with the Fermi (CC 2.x) generation are emphasized.
NVIDIA Kepler GK104 and GeForce GTX 680
A detailed description of the GK104 chip and GTX 680, one of the first Kepler GPUs, implementing CC 3.0 and intended for consumer use. Full of useful facts, descriptions, and comparisons. The marketing and legal teams must have been on vacation when it was time to review it.

NVIDIA Fermi Generation (CC 2.x)

A reasonably technical description of Fermi, the first big change in NVIDIA GPU architecture since the G80. This whitepaper emphasizes non-graphical applications.
Another reasonably technical description of the Fermi, with more emphasis on its use for graphics.


ATI R5xx Series
A detailed description of the CPU/GPU interface, intended for device driver writers. Could use some editing.

Tensor Processing / Neural Network Accelerator Descriptions

Production Accelerators

Design Considerations for Google's Tensor Processing Unit
A description of Google's TPU accelerator designed for inference workloads in Google's data centers. The paper looks at how expected workload (mostly FC and RNN layers, few CNN layers), reliably low-latency requirements (okay Google, the impatient thank you), and design time constraints affected its design. Its performance is compared with CPUs and GPUs, and some design-space exploration results are presented.

Xeon Phi Accelerator Descriptions

Xeon Phi overview and Detailed Microarchitecture Description
The book provides a technically detailed overview of the Phi. See Chapters 3 and 4 for a description of the core microarchitecture.
Xeon Phi Instruction Set Architecture
Intel's reference for the Xeon Phi ISA.
Description of Intel's Larrabbee GPU
A description of what Intel hoped would be the next step in GPU evolution. It was originally aimed at both graphics and non-graphical applications, but the first product (Intel Xeon Phi) targets only non-graphical applications. Though this is an early description, it is detailed and many important features have not changed.

Proposed Accelerator Microarchitecture Features

A technique to avoid some “wasted” storage in NVIDIA GPU designs.
NVIDIA GPUs provide a set of registers for each thread, while striving to keep warps of threads at the same point in program execution. One of these registers can hold values that are identical for each thread in a warp, for example, loop indices of loops that start at zero, or which differ by some constant times the lane number. The paper analyzes miciroarchitectural features to exploit such value structure in Fermi-like GPU designs.

Proposed GPU / Accelerator Designs

Description of Echelon, NVIDIA's proposed hybrid CPU/GPU design.
This paper describes a proposed future hybrid CPU/GPU design, Echelon. In this design energy is just as much a resource constraint as chip area and pin bandwidth. The chip includes both CPU and GPU cores, as is already the case in some commercial products, but here the GPU cores are much more flexible than Kepler-generation cores. For additional information see Steve Keckler's keynote address slides from IISWC 2011.
Description of Echelon, NVIDIA's proposed hybrid CPU/GPU design.
Another look at the Echelon design, evaluating it in terms of some DoD proxy applications. The design is similar to the one proposed in the 2011 paper, with minor differences such as the elimination (or downplaying of) spatial warps.

Older GPU Descriptions

NVIDIA GeForce 8800 GPU
Relatively little detail but does cover all major parts of the processor. More information can be gleaned from the CUDA documentation. Also see the GeForce-8800-specific OpenGL extensions.

ECE Home Page Course Home Page
David M. Koppelman -
Modified 22 Apr 2020 17:54 (2254 UTC)
Provide Website Feedback  • Accessibility Statement  • Privacy Statement