EE 7722 - References

References for material covered in lectures and needed to complete the assignments. This page will be updated as the semester progresses.

When available, links are provided to full-text versions of the material. If a dialog pops up asking for a username and password use “ee4720” as the username and use the password given in class. This does not apply to the ACM digital library material. Some of the material is copyrighted and requires a subscription (e.g., ACM Digital Library) or one-time payment for access. LSU has a site subscription to the ACM DL (as of this writing).

Graphics Processor APIs

CUDA-Related Documentation and Non-Graphical Programming

NVIDIA CUDA Programming Guide
CUDA is an API for using a GPU for computation, the computation might be part of a scientific or engineering simulation (the most common application). This programming guide describes CUDA itself but also provides details of some NVIDIA GPUs.
PTX, NVIDIA's GPU Pseudo Assembly Language
PTX is a pseudo assembly language (or intermediate language) generated by the CUDA compiler front end. Though it resembles assembly language it should not be used for optimization because PTX and native machine instructions have significant differences and because a substantial amount of optimization is performed by the later stages. PTX is converted to the true machine language of the GPU by other NVIDIA build tools.
A description of the instruction sets in the NVIDIA GT200 (CC 1.x), Fermi (CC 2.X) GPUs, and Kepler (CC 3.X) GPUs.
PTX is a pseudo assembly language (or intermediate language) generated by the CUDA compiler. The language described in this reference is the true machine language, it should be used in place of PTX for understanding the GPU microarchitecture and for fine-tuning code. The reference describes the language itself and the tool for generating this assembler from CUDA binaries (disassembling).


The OpenGL 4.2 Compatitbility Profile Specification
A description of the OpenGL API for using GPU hardware. The phrase "compatibility profile" in the name refers to the version of the specification that includes API features that are not needed for modern GPUs but are preserved for software compatibility and perhaps as a learning aid. The core profile, which won't be used in class, version of the specification eliminates these API features.
NVIDIA OpenGL Extension Specifications
OpenGL extensions allow a GPU manufacturer to provide OpenGL access to new features without having to wait for a new version of the OpenGL language to be approved.

Shader Language APIs

OpenGL Shading Language 4.20
C-like language used to program GPUs.
Survey of Shader Languages
A good survey of shader language APIs, though a bit outdated. Also avaialble from the XEngine project website.

GPU Algorithms

Algorithm Components and Comparisons

GPU v. CPU Comparison of Many Common Algorithms
A comparison of the performance of many common scientific, media, and other algorithms on an NVIDIA CC 1.3 GPU and an Intel i7 quad core processor. Factors responsible for CPU or GPU advantage are explored. Though the title uses the attention-getting word "debunking" the paper concentrates on understanding how these codes execute.
GPU Algorithm Building-Blocks (Patterns)
An examination of GPU program building blocks (such as organizing data to avoid underutilized requests), and their effectiveness over several GPU generations.


Matrix Multiplication, Factoring
Description of dense matrix multiplication and other algorithms implemented on CUDA for the NVIDIA G80 series. The paper starts with detailed benchmark results and uses these to determine the performance potential of the various matrix operations. The CC 1.x (Tesla-generation) GPU that they target is obsolete, but the methodology used in the paper, machine-code level analysis to maximize instruction density, is still valuable, as is the overall methods for data staging.
Sorting (Radix and Merge)
The paper describes the implementation of two sorting algorithms on NVIDIA GPUs, a radix sort and a merge sort. The implementations are carefully tuned to make the most efficient use of the GPU. (In a perfect world this could be said about every published algorithm.) The source code is part of the CUDA SDK and is presently available on the ECE computers in directory /home/classes/ee7700/com/nvidia-gpu-sdk/C/src/radixSort.
Single-Source Shortest Path [between graph nodes]
Implementations of SSSP (single-source shortest path) algorithms on Kepler-generation GPUs. Irregular work (because the number of edges incident to a vertex can vary greatly) and scattered data (there's no way to lay out a graph so that consecutive threads will always access consecutive items) are the challenges faced in GPU implementation.

Computer Architecture Topics Not Specific to Accelerators


Effectiveness of Software and Hardware Prefetching
A detailed analysis of how effective various software and hardware (alone and in combination) data prefetching schemes are on SPECcpu2006 benchmarks.
Prefetch Survey Paper
A survey of software and hardware data prefetch techniques. The material is well-explained but is slightly dated (in 2015).

ECE Home Page Course Home Page
David M. Koppelman -
Modified 22 Oct 2017 13:04 (1804 UTC)