PLUTO - An automatic parallelizer and
locality optimizer for multicores


PLUTO is an automatic parallelization tool that is based on the polyhedral model. The Polyhedral model is a geometrical representation for programs that utilizes machinery from Linear Algebra and Linear Programming for analysis and high-level transformations. PLUTO transforms C programs from source to source for coarse-grained parallelism and locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to it.

Current trends in microarchitecture are increasingly towards larger number of processing elements on a single chip. This has made parallelism and multi-core processors mainstream. The difficulty of programming these architectures to effectively tap the potential of multiple on-chip processing units is a significant challenge. Among several approaches to addressing this issue, one that is very promising but simultaneously very challenging is automatic parallelization. This requires no effort on part of the programmer in the process of parallelization and optimization and is therefore very attractive.

Many compute-intensive applications often spend most of their execution time in nested loops. This is particularly common in scientific and engineering applications. The polyhedral model provides a powerful abstraction to reason about transformations on such loop nests by viewing a dynamic instance (iteration) of each statement as an integer point in a well-defined space called the statement's polyhedron. With such a representation for each statement and a precise characterization of inter or intra-statement dependences, it is possible to reason about the correctness of complex loop transformations in a completely mathematical setting relying on machinery from linear algebra and linear programming. The transformations finally reflect in the generated code as reordered execution with improved cache locality and/or loops that have been parallelized. The polyhedral model is applicable to loop nests in which the data access functions and loop bounds are affine combinations (linear combination with a constant) of the enclosing loop variables and parameters.

Tiling is a key transformation in optimizing for parallelism and data locality. There has been a considerable amount of research into these two transformations. Tiling has been studied from two perspectives: data locality optimization and parallelization. Tiling for locality requires grouping points in an iteration space into smaller blocks (tiles) allowing reuse in multiple directions when the block fits in a faster memory (registers, L1, or L2 cache). Tiling for coarse-grained parallelism involves partitioning the iteration space into tiles that may be concurrently executed on different processors with a reduced frequency and volume of inter-processor communication: a tile is atomically executed on a processor with communication required only before and after execution. One of the key aspects of our transformation framework is to find good ways of performing tiling.

Finding good ways to tile for parallelism and locality directly through an affine transformation framework is the central idea in PLUTO. Our approach is thus a departure from scheduling-based approaches in this field as well as partitioning-based approaches (due to incorporation of more concrete optimization criteria). However, it is built on the same mathematical foundations and machinery. Model-driven empirical optimization and automatic tuning approaches (e.g., ATLAS) have been shown to be very effective in optimizing single-processor execution for some regular kernels like matrix-matrix multiplication. There is considerable interest in developing effective empirical tuning approaches for arbitrary input kernels. Our framework can enable such model-driven or guided empirical search to be applied to arbitrary affine programs, in the context of both sequential and parallel execution. Also, since our transformation system operates entirely in the polyhedral abstraction, it is not just limited to C or Fortran code, but could accept any high-level language from which polyhedral domains can be extracted and analyzed.

Papers

PLUTO is described in the following papers and technical reports:

Performance

Summary of results below are with Pluto git version 114e419014c6f14b3f193726e951b935ad120466 (25/05/2011) on an Intel Core i7 870 (2.93 GHz), 4 GB DDR3-1333 RAM running Linux 2.6.35 32-bit with ICC 12.0.3. Examples, problem sizes, etc. are in examples/ dir (or see git examples/ tree). Actual running times (in seconds) are here.

Pluto results summary

Old

All results below (old) were from a Intel Core2 Quad Q6600 on Linux x86-64 (2.6.18) with ICC 10.1 used to compile original and transformed/parallelized codes. These were from pluto-0.0.1.

Imperfect Jacobi stencil

Jacobi (single core) Jacobi (multiple cores)

2-d FDTD

LU LU

LU decomposition

LU LU

GEMVER, Doitgen

GEMVER Doitgen

DGEMM

LU LU

ATLAS was tuned with GCC 4.1. kSelMM perf numbers were approximated using the timing report produced after tuning ATLAS ('make time'). kSelMM, kGenMM perf on 2, 3 cores was interpolated (linearly) from perf on 1 and 4 cores.

For addditional details see the PLUTO website.

Funding

This work was supported in part by the National Science Foundation through awards 0121676, 0121706, 0403342, 0508245, 0509442, 0509467, 0541409, 0811457, and 0811781.

Acknowledgments

We would like to thank Cedric Bastoul (Paris-Sud XI University, Orsay, France) very much for CLooG. We wish to acknowledge Martin Griebl and team (FMI, Universitat Passau, Germany) for the LooPo infrastructure. In addition, we thank Alain Darte for useful feedback.