Descriptions of accelerators or ideas for accelerators, past, present, and future. The papers and other items linked to this page will be used in class.
Modern GPU DescriptionsNVIDIA Hopper Generation (CC 9.0)NVIDIA Hopper (CC 9.0) Architecture
A description of the Hopper architecture, CC 9.0.
NVIDIA H100 Tensor Core GPU Architecture (v1.02) 2022 (7.88 MB PDF)
NVIDIA Ampere Generation (CC 8.0)NVIDIA Ampere (CC 8.0) Architecture
A description of the Amphere architecture, CC 8.0.
NVIDIA A100 Tensor Core GPU Architecture (v1.0) 2020 (7.70 MB PDF)
NVIDIA Turing Generation (CC 7.5)NVIDIA Turing (CC 7.5) Architecture
A description of the Turing architecture, CC 7.5,
and the initial TU102 implementation. In addition to Volta's memory
system differences, Turing adds support for real-time ray tracing
and low-precision integer tensor unit operations. [Why is it that
the scientific computing community refuses to downgrade from 64-bit
floating point, while the ML community has no problems making
low precision operations work?]
NVIDIA Turing GPU Architecture (2018) (16.8 MB PDF)
NVIDIA Volta Generation (CC 7.0)NVIDIA Volta V100 (CC 7.0) Architecture, GV100 Implementation
A description of the Volta architecture, CC 7.0,
and the initial GV100 implementation. There are several big
differences with the previous Pascal generation,
including tensor cores to support DNN machine learning workloads,
an L1 cache (not just read-only), a better global memory
model, and better thread synchronization.
NVIDIA Tesla V100 (August 2017) (3.86 MB PDF)
NVIDIA Pascal Generation (CC 6.x)NVIDIA Pascal (CC 6.x) Architecture, GP100 Implementation, and Tesla P100 Accelerator
A description of the Pascal architecture, CC 6.x, (actually just 6.0),
the initial GP100 implementation, and the Tesla P100 accelerator board.
Double-precision is
back! CC 6.0 devices can issue DP insn at half the rate of SP, half-precision
at twice the rate of SP.
Attention 99%: don't rush to Best Buy, the initial offering (the P100) has
an MSRP of $129,000.
NVIDIA Tesla P100 (Version 01) (2.69 MB PDF)
NVIDIA Maxwell Generation (CC 5.x)NVIDIA Maxwell GM204 and GeForce GTX 980
A detailed description of the GM204 chip and the GTX 980 board, one of
the first Maxwell GPUs, implementing CC 5.0 and intended for consumer
use. In some ways Maxwell is a simplified version of Kepler: The
number of SP functional units per SM was reduced from 192 to 128, removing
the annoying 1.5 warps epr cycle per scheduler case. Also, shared
memory uses its own storage, it's no longer using the same storage
as the L1 cache.
NVIDIA, GeForce GTX 980, NVIDIA whitepaper. (2.24 MB PDF)
NVIDIA Kepler Generation (CC 3.x)NVIDIA Kepler GK110
A description of an NVIDIA Kepler GPU meant for scientific computation,
one that implements CC 3.5. Differences with the Fermi (CC 2.x)
generation are emphasized.
NVIDIA Kepler GK104 and GeForce GTX 680
A detailed description of the GK104 chip and GTX 680, one of the first
Kepler GPUs, implementing CC 3.0 and intended for consumer use. Full
of useful facts, descriptions, and comparisons. The marketing and
legal teams must have been on vacation when it was time to review it.
NVIDIA, GeForce GTX 680, NVIDIA whitepaper. (1.44 MB PDF)
NVIDIA Fermi Generation (CC 2.x)NVIDIA Fermi
A reasonably technical description of Fermi, the first big change
in NVIDIA GPU architecture since the G80. This whitepaper emphasizes
non-graphical applications.
NVIDIA GF100
Another reasonably technical description of the Fermi, with
more emphasis on its use for graphics.
ATI GPSsATI R5xx Series
A detailed description of the CPU/GPU interface, intended for
device driver writers. Could use some editing.
Tensor Processing / Neural Network Accelerator DescriptionsProduction AcceleratorsDesign Considerations for Google's Tensor Processing Unit
A description of Google's TPU accelerator designed for inference
workloads in Google's data centers. The paper looks at how expected
workload (mostly FC and RNN layers, few CNN layers), reliably
low-latency requirements (okay Google, the impatient thank you), and
design time constraints affected its design. Its performance is
compared with CPUs and GPUs, and some design-space exploration results
are presented.
Xeon Phi Accelerator DescriptionsXeon Phi overview and Detailed Microarchitecture Description
The book provides a technically detailed overview of the Phi. See
Chapters 3 and 4 for a description of the core microarchitecture.
Xeon Phi Instruction Set Architecture
Intel's reference for the Xeon Phi ISA.
Description of Intel's Larrabbee GPU
A description of what Intel hoped would be the next step in
GPU evolution. It was originally aimed at both graphics and
non-graphical applications, but the first product (Intel Xeon Phi)
targets only non-graphical applications. Though this is an early
description, it is detailed and many important features have
not changed.
Proposed Accelerator Microarchitecture FeaturesA technique to avoid some “wasted” storage in NVIDIA GPU designs.
NVIDIA GPUs provide a set of registers for each thread, while striving
to keep warps of threads at the same point in program execution.
One of these registers can hold values that are identical for
each thread in a warp, for example, loop indices of loops that
start at zero, or which differ by some constant times the lane number.
The paper analyzes miciroarchitectural features to exploit
such value structure in Fermi-like GPU designs.
Proposed GPU / Accelerator DesignsDescription of Echelon, NVIDIA's proposed hybrid CPU/GPU design.
This paper describes a proposed future hybrid CPU/GPU design, Echelon.
In this design energy is just as much a resource constraint as
chip area and pin bandwidth. The chip includes both CPU and GPU
cores, as is already the case in some commercial products, but here
the GPU cores are much more flexible than Kepler-generation cores.
For additional information see Steve
Keckler's keynote address slides from IISWC 2011.
Description of Echelon, NVIDIA's proposed hybrid CPU/GPU design.
Another look at the Echelon design, evaluating it in terms of
some DoD proxy applications. The design is similar to the
one proposed in the 2011 paper, with minor differences such
as the elimination (or downplaying of) spatial warps.
Older GPU DescriptionsNVIDIA GeForce 8800 GPU
Relatively little detail but does cover all major parts of the
processor. More information can be gleaned from the CUDA documentation.
Also see the GeForce-8800-specific OpenGL extensions.
NVIDIA GeForce6800 GPU
A reasonably detailed description.
|
David M. Koppelman - koppel@ece.lsu.edu | Modified 4 Nov 2022 18:03 (2303 UTC) |
Provide Website Feedback • Accessibility Statement • Privacy Statement |