EE 7722 - References

References for material covered in lectures and needed to complete the assignments. This page will be updated as the semester progresses.

When available, links are provided to full-text versions of the material. If a dialog pops up asking for a username and password use “ee4720” as the username and use the password given in class. This does not apply to the ACM digital library material. Some of the material is copyrighted and requires a subscription (e.g., ACM Digital Library) or one-time payment for access. LSU has a site subscription to the ACM DL (as of this writing).

Graphics Processor APIs

CUDA-Related Documentation and Non-Graphical Programming

NVIDIA CUDA Programming Guide
CUDA is system for using NVIDIA GPUs for computation. The computation might be part of a scientific or engineering simulation (the most common application). This programming guide describes CUDA itself but also provides details of some NVIDIA GPUs.
PTX, NVIDIA's GPU Pseudo Assembly Language
PTX is an intermediate language (or pseudo assembly language) generated by the CUDA compiler front end. Though it resembles assembly language it should not be used for very tight optimization because PTX and native machine instructions can have significant differences and because a substantial amount of optimization is performed by the later stages. That said, PTX is useful when it is important that code use particular machine instructions. PTX is converted to the true machine language, informally called SASS, of the GPU by other NVIDIA build tools.
Disassembly Tools and GPU Instruction Set Description
PTX is a pseudo assembly language (or intermediate language) generated by the CUDA compiler. The languages described in this reference are the true machine languages. They should be used in place of PTX for understanding the GPU microarchitecture and for fine-tuning code. The reference describes the language itself and tools for generating this assembler from CUDA binaries (disassembling).


The OpenGL 4.6 Compatitbility Profile Specification
A description of the OpenGL API for using GPU hardware. The phrase “compatibility profile” in the name refers to the version of the specification that includes API features that are not needed for modern GPUs but are preserved for software compatibility and perhaps as a learning aid.
NVIDIA OpenGL Extension Specifications
OpenGL extensions allow a GPU manufacturer to provide OpenGL access to new features without having to wait for a new version of the OpenGL language to be approved. The extensions in this reference are quite old, most have since been incorporated into OpenGL. Consider this an historical reference.

Shader Language APIs

OpenGL Shading Language 4.60.5
C-like language used to program GPUs. OpenGL Shading Language (OGSL) code is managed using the OpenGL API. OGSL code can implement parts of the rendering pipeline (such as a vertex shader) or can run independently of the rendering pipeline (as a compute shader).
Survey of Shader Languages
A good survey of shader language APIs, though a bit outdated. Also avaialble from the XEngine project website.

GPU Algorithms

Algorithm Components and Comparisons

GPU v. CPU Comparison of Many Common Algorithms
A comparison of the performance of many common scientific, media, and other algorithms on an NVIDIA CC 1.3 GPU and an Intel i7 quad core processor. Factors responsible for CPU or GPU advantage are explored. Though the title uses the attention-getting word "debunking" the paper concentrates on understanding how these codes execute.
GPU Algorithm Building-Blocks (Patterns)
An examination of GPU program building blocks (such as organizing data to avoid underutilized requests), and their effectiveness over several GPU generations.


Matrix Multiplication, Factoring
Description of dense matrix multiplication and other algorithms implemented on CUDA for the NVIDIA G80 series. The paper starts with detailed benchmark results and uses these to determine the performance potential of the various matrix operations. The CC 1.x (Tesla-generation) GPU that they target is obsolete, but the methodology used in the paper, machine-code level analysis to maximize instruction density, is still valuable, as is the overall methods for data staging.
Sorting (Radix and Merge)
The paper describes the implementation of two sorting algorithms on NVIDIA GPUs, a radix sort and a merge sort. The implementations are carefully tuned to make the most efficient use of the GPU. (In a perfect world this could be said about every published algorithm.) The source code is part of the CUDA SDK and is presently available on the ECE computers in directory /home/classes/ee7700/com/nvidia-gpu-sdk/C/src/radixSort.
Single-Source Shortest Path [between graph nodes]
Implementations of SSSP (single-source shortest path) algorithms on Kepler-generation GPUs. Irregular work (because the number of edges incident to a vertex can vary greatly) and scattered data (there's no way to lay out a graph so that consecutive threads will always access consecutive items) are the challenges faced in GPU implementation.

Computer Architecture Topics Not Specific to Accelerators


Effectiveness of Software and Hardware Prefetching
A detailed analysis of how effective various software and hardware (alone and in combination) data prefetching schemes are on SPECcpu2006 benchmarks.
Prefetch Survey Paper
A survey of software and hardware data prefetch techniques. The material is well-explained but is slightly dated (in 2015).

ECE Home Page Course Home Page
David M. Koppelman -
Modified 8 Jan 2020 17:53 (2353 UTC)
Provide Website Feedback  • Accessibility Statement  • Privacy Statement