EE 7722 - References

References for material covered in lectures and needed to complete the assignments. This page will be updated as the semester progresses.

When available, links are provided to full-text versions of the material. If a dialog pops up asking for a username and password use “ee4720” as the username and use the password given in class. This does not apply to the ACM digital library material. Some of the material is copyrighted and requires a subscription (e.g., ACM Digital Library) or one-time payment for access. LSU has a site subscription to the ACM DL (as of this writing).

Graphics Processor APIs

GPU Algorithms

Computer Architecture Topics Not Specific to Accelerators

Graphics Processor APIs

CUDA-Related Documentation and Non-Graphical Programming

NVIDIA CUDA Programming Guide

CUDA is system for using NVIDIA GPUs for computation. The computation might be part of a scientific or engineering simulation (the most common application). This programming guide describes CUDA itself but also provides details of some NVIDIA GPUs.

NVIDIA, “CUDA C Programming Guide,” PG-02829-001_v10.0, October 2018 or later, or link is stale.

PTX, NVIDIA's GPU Pseudo Assembly Language

PTX is an intermediate language (or pseudo assembly language) generated by the CUDA compiler front end. Though it resembles assembly language it should not be used for very tight optimization because PTX and native machine instructions can have significant differences and because a substantial amount of optimization is performed by the later stages. That said, PTX is useful when it is important that code use particular machine instructions. PTX is converted to the true machine language, informally called SASS, of the GPU by other NVIDIA build tools.

NVIDIA, “Parallel Thread Execution ISA, Version 6.3,” October 2018 or later.

Disassembly Tools and GPU Instruction Set Description

PTX is a pseudo assembly language (or intermediate language) generated by the CUDA compiler. The languages described in this reference are the true machine languages. They should be used in place of PTX for understanding the GPU microarchitecture and for fine-tuning code. The reference describes the language itself and tools for generating this assembler from CUDA binaries (disassembling).

NVIDIA, “CUDA Binary Utilities,” DA-06762-001_v10.0, October 2018 or later.

OpenGL

The OpenGL 4.6 Compatitbility Profile Specification

A description of the OpenGL API for using GPU hardware. The phrase “compatibility profile” in the name refers to the version of the specification that includes API features that are not needed for modern GPUs but are preserved for software compatibility and perhaps as a learning aid.

Mark Segal, Kurt Akeley, “The OpenGL Graphics System: A Specification (Version 4.6 (Compatibility Profile) - May 14, 2018)” (3.65 MB PDF)

NVIDIA OpenGL Extension Specifications

OpenGL extensions allow a GPU manufacturer to provide OpenGL access to new features without having to wait for a new version of the OpenGL language to be approved. The extensions in this reference are quite old, most have since been incorporated into OpenGL. Consider this an historical reference.

NVIDIA OpenGL Extension Specifications, February 22, 2008. (6.68 MB PDF)

Shader Language APIs

OpenGL Shading Language 4.60.5

C-like language used to program GPUs. OpenGL Shading Language (OGSL) code is managed using the OpenGL API. OGSL code can implement parts of the rendering pipeline (such as a vertex shader) or can run independently of the rendering pipeline (as a compute shader).

John Kessenich (Editor), David Baldwin and Randi Rost (Version 1.1 Authors) “The OpenGL Shading Language, Version 4.60.5” 14 June 2018 16:22:42 UTC (5.04 MB PDF)

Survey of Shader Languages

A good survey of shader language APIs, though a bit outdated. Also avaialble from the XEngine project website.

Martin Ecker, "Programmable Graphics Pipeline Architectures", XEngine, March 2003. (373 kB PDF)

GPU Algorithms

Algorithm Components and Comparisons

GPU v. CPU Comparison of Many Common Algorithms

A comparison of the performance of many common scientific, media, and other algorithms on an NVIDIA CC 1.3 GPU and an Intel i7 quad core processor. Factors responsible for CPU or GPU advantage are explored. Though the title uses the attention-getting word "debunking" the paper concentrates on understanding how these codes execute.

Lee, Victor W. and Kim, Changkyu and Chhugani, Jatin and Deisher, Michael and Kim, Daehyun and Nguyen, Anthony D. and Satish, Nadathur and Smelyanskiy, Mikhail and Chennupaty, Srinivas and Hammarlund, Per and Singhal, Ronak and Dubey, Pradeep, “Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU”, in the Proceedings of the 37th annual international symposium on Computer architecture,, 2010, pp. 451-460. (482 kB PDF)

GPU Algorithm Building-Blocks (Patterns)

An examination of GPU program building blocks (such as organizing data to avoid underutilized requests), and their effectiveness over several GPU generations.

Stratton, J.A. Anssari, N. Rodrigues, C. I-Jui Sung Obeid, N. Liwen Chang Liu, G.D. and Hwu, W., “Optimization and architecture effects on GPU computing workload performance,” Innovative Parallel Computing (InPar), 2012, pp 1-10. (1.92 MB PDF)

Algorithms

Matrix Multiplication, Factoring

Description of dense matrix multiplication and other algorithms implemented on CUDA for the NVIDIA G80 series. The paper starts with detailed benchmark results and uses these to determine the performance potential of the various matrix operations. The CC 1.x (Tesla-generation) GPU that they target is obsolete, but the methodology used in the paper, machine-code level analysis to maximize instruction density, is still valuable, as is the overall methods for data staging.

Vasily Volkov and James W. Demmel, “Benchmarking GPUs to Tune Dense Linear Algebra,” Supercomputing 2008. (649 kB PDF)

Sorting (Radix and Merge)

The paper describes the implementation of two sorting algorithms on NVIDIA GPUs, a radix sort and a merge sort. The implementations are carefully tuned to make the most efficient use of the GPU. (In a perfect world this could be said about every published algorithm.) The source code is part of the CUDA SDK and is presently available on the ECE computers in directory /home/classes/ee7700/com/nvidia-gpu-sdk/C/src/radixSort.

Nadathur Satish, Mark Harris, and Michael Garland, “Designing efficient sorting algorithms for manycore GPUs,” in the Proceedings of the IEEE International Parallel & Distributed Processing Symposium 2009. (891 kB PDF)

Single-Source Shortest Path [between graph nodes]

Implementations of SSSP (single-source shortest path) algorithms on Kepler-generation GPUs. Irregular work (because the number of edges incident to a vertex can vary greatly) and scattered data (there's no way to lay out a graph so that consecutive threads will always access consecutive items) are the challenges faced in GPU implementation.

Andrew Davidson, Sean Baxter, Michael Garland, John D. Owens, “Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths,” in the Proceedings of the IEEE International Parallel & Distributed Processing Symposium 2014. (744 kB PDF)

Computer Architecture Topics Not Specific to Accelerators

Prefetching

Effectiveness of Software and Hardware Prefetching

A detailed analysis of how effective various software and hardware (alone and in combination) data prefetching schemes are on SPECcpu2006 benchmarks.

Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, "When Prefetching Works, When It Doesn't, and Why," ACM TACO, vol. 9, no. 1, 2012, a2. (2.07 MB PDF)

Prefetch Survey Paper

A survey of software and hardware data prefetch techniques. The material is well-explained but is slightly dated (in 2015).

Steven P. Vanderwiel and David J. Lilja, “Data prefetch mechanisms,” ACM Computing Surveys, vol.~32, no.~2, June~2000. (96.2 kB PDF)

David M. Koppelman - koppel@ece.lsu.edu	Modified 8 Jan 2020 17:53 (2353 UTC)
Provide Website Feedback • Accessibility Statement • Privacy Statement