| Audience |
This tutorial is targeted primarily at application developers, computer/computational scientists, and graduate students interested in performance optimization issues and/or compilers for high-performance computing. Knowledge of C programming will be assumed; basic knowledge of processor architectures will be assumed; no prior parallel programming experience will be assumed.
| Brief Description |
On-chip parallelism with multiple cores is now ubiquitous. Because of power and cooling constraints, recent performance improvements in both general-purpose and special-purpose processors have come primarily from increased on-chip parallelism rather than increased clock rates. Parallelism is therefore of considerable interest to a much broader group than developers of parallel applications for high-end supercomputers. Several programming environments have recently emerged in response to the need to develop applications for GPUs, the Cell processor, and multi-core processors from AMD, IBM, Intel etc. As commodity computing platforms all go parallel, programming these platforms in order to attain high performance has become an extremely important issue. There has been considerable recent interest in two complementary approaches:
This tutorial will provide an introductory survey covering both these aspects. In contrast to conventional multicore architectures, GPUs and the Cell processor have to exploit parallelism while managing the physical memory on the processor (since there are no caches) by explicitly orchestrating the movement of data between large off-chip memories and the limited on-chip memory. This tutorial will address the issue of explicit memory management in detail.
- developing programming models that explicitly expose the programmer to parallelism; and
- compiler optimization frameworks to automatically transform sequential programs for parallel execution.
| Lecture Outline |
- Introduction
- Multicore and GPGPU architectures
- Issues in performance
- Explicitly managed memories
- Stream programming
- GPUs: Architectures and programming
- GPU architectures
- General-purpose computation on GPUs
- Programming models and idioms
- GPU programming models/environments:
- OpenCL
- CUDA
- Brook+
- RapidMind
- Code examples on GPUs
- Examples of CPU vs. GPU performance
- Compiler optimizations for multicore
- Brief review of data dependences, transformations
- Polyhedral models and tiling
- Locality and parallelism optimizations
- Reducing synchronization overheads
- Compiler optimizations for GPUs
- Performance characterization
- Optimizing memory accesses
- Multi-level parallelism exploitation
- Performance models and emipirical search
- Examples of application optimization
- Software managed memory hierarchies
- Memory management for correctness and performance
- Scratchpad memory management
- Effect of bandwidth
- Explicit data movement
| Tutorial Speakers |
J. (Ram) Ramanujam received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology, Madras, India in 1983, and his M.S. and Ph. D. degrees in Computer Science from The Ohio State University in 1987 and 1990 respectively. He is currently the John E. and Beatrice L. Ritter Distinguished Professor in the Department of Electrical and Computer Engineering at Louisiana State University. His research interests are in compilers for high-performance computer systems, embedded systems, software optimizations for low-power computing, high-level hardware synthesis, parallel architectures and algorithms. He has led several NSF-funded projects.
P. (Saday) Sadayappan received the B. Tech. degree from the Indian Institute of Technology, Madras, India, and an M.S. and Ph. D. from the State University of New York at Stony Brook, all in Electrical Engineering. He is currently a Professor in the Department of Computer Science and Engineering at The Ohio State University. His research interests include Compile/Runtime Optimization and Scheduling and Resource Management for Parallel/Distributed Systems. He has led several NSF-funded projects including the Tensor Contraction Engine and the Pluto project.