ICS 2009 Tutorial: Programming Models and Compiler Optimizations for GPUs and Multi-Core Processors

Programming Models and Compiler Optimizations
for GPUs and Multi-Core Processors

Half-day Tutorial at

23rd International Conference on Supercomputing, ICS 2009
June 8-12, 2009
IBM T.J. Watson Research Center, Metro New York City Area, USA

J. (Ram) Ramanujam
Department of Electrical and Computer Engineering
and Center for Computation and Technology
Louisiana State University
Baton Rouge, LA 70803, USA

P. (Saday) Sadayappan
Department of Computer Science and Engineering
The Ohio State University
Columbus, OH 43210, USA

Audience

This tutorial is targeted primarily at application developers, computer/computational scientists, and graduate students interested in performance optimization issues and/or compilers for high-performance computing. Knowledge of C programming will be assumed; basic knowledge of processor architectures will be assumed; no prior parallel programming experience will be assumed.

Brief Description

On-chip parallelism with multiple cores is now ubiquitous. Because of power and cooling constraints, recent performance improvements in both general-purpose and special-purpose processors have come primarily from increased on-chip parallelism rather than increased clock rates. Parallelism is therefore of considerable interest to a much broader group than developers of parallel applications for high-end supercomputers. Several programming environments have recently emerged in response to the need to develop applications for GPUs, the Cell processor, and multi-core processors from AMD, IBM, Intel etc. As commodity computing platforms all go parallel, programming these platforms in order to attain high performance has become an extremely important issue. There has been considerable recent interest in two complementary approaches:

developing programming models that explicitly expose the programmer to parallelism; and
compiler optimization frameworks to automatically transform sequential programs for parallel execution.
This tutorial will provide an introductory survey covering both these aspects. In contrast to conventional multicore architectures, GPUs and the Cell processor have to exploit parallelism while managing the physical memory on the processor (since there are no caches) by explicitly orchestrating the movement of data between large off-chip memories and the limited on-chip memory. This tutorial will address the issue of explicit memory management in detail.

Lecture Outline

Introduction

Multicore and GPGPU architectures
Issues in performance
Explicitly managed memories
Stream programming

GPUs: Architectures and programming

GPU architectures
General-purpose computation on GPUs
Programming models and idioms
GPU programming models/environments:

OpenCL
CUDA
Brook+
RapidMind

Code examples on GPUs
Examples of CPU vs. GPU performance

Compiler optimizations for multicore

Brief review of data dependences, transformations
Polyhedral models and tiling
Locality and parallelism optimizations
Reducing synchronization overheads

Compiler optimizations for GPUs

Performance characterization
Optimizing memory accesses
Multi-level parallelism exploitation
Performance models and emipirical search
Examples of application optimization

Software managed memory hierarchies

Memory management for correctness and performance
Scratchpad memory management
Effect of bandwidth
Explicit data movement

Tutorial Speakers

J. (Ram) Ramanujam received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology, Madras, India in 1983, and his M.S. and Ph. D. degrees in Computer Science from The Ohio State University in 1987 and 1990 respectively. He is currently the John E. and Beatrice L. Ritter Distinguished Professor in the Department of Electrical and Computer Engineering at Louisiana State University. His research interests are in compilers for high-performance computer systems, embedded systems, software optimizations for low-power computing, high-level hardware synthesis, parallel architectures and algorithms. He has led several NSF-funded projects.
P. (Saday) Sadayappan received the B. Tech. degree from the Indian Institute of Technology, Madras, India, and an M.S. and Ph. D. from the State University of New York at Stony Brook, all in Electrical Engineering. He is currently a Professor in the Department of Computer Science and Engineering at The Ohio State University. His research interests include Compile/Runtime Optimization and Scheduling and Resource Management for Parallel/Distributed Systems. He has led several NSF-funded projects including the Tensor Contraction Engine and the Pluto project.

Programming Models and Compiler Optimizations for GPUs and Multi-Core Processors Half-day Tutorial at 23rd International Conference on Supercomputing, ICS 2009 June 8-12, 2009 IBM T.J. Watson Research Center, Metro New York City Area, USA

J. (Ram) Ramanujam Department of Electrical and Computer Engineering and Center for Computation and Technology Louisiana State University Baton Rouge, LA 70803, USA P. (Saday) Sadayappan Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210, USA

Programming Models and Compiler Optimizations
for GPUs and Multi-Core Processors

Half-day Tutorial at

23rd International Conference on Supercomputing, ICS 2009
June 8-12, 2009
IBM T.J. Watson Research Center, Metro New York City Area, USA

J. (Ram) Ramanujam
Department of Electrical and Computer Engineering
and Center for Computation and Technology
Louisiana State University
Baton Rouge, LA 70803, USA

P. (Saday) Sadayappan
Department of Computer Science and Engineering
The Ohio State University
Columbus, OH 43210, USA