PrimeTile - A parametric multi-level tiler for imperfect loop nests


Loop tiling, as one of the most important compiler optimization techniques, is beneficial for both parallel machines and uniprocessors. Efficient generation of multi-level tiled code is essential to maximize data reuse in deep memory hierarchies. Tiled loops with parameterized tile sizes (not compile time constants) enable runtime optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all statements are contained inside the innermost loop of a loop nest. Previous solutions to tiling for imperfect loop nests have been limited to the case where tile sizes are fixed. PrimeTile provides an effective way to generate efficient parameterized multi-level tiled code for imperfectly nested loops. The generated tiled code contains loops that iterate over full rectangular tiles, making them suitable for compiler optimizations such as register tiling.

Tiling is a crucial transformation for achieving high performance, especially with deep multi-level memory hierarchies. It is a well known technique for improving data locality and register reuse. Tiling has received a lot of attention in the compiler community. However, the majority of work only addresses tiling of perfectly nested loops. With perfectly nested loops, tiling is possible when a band of loops is fully permutable. From a code viewpoint, tiling involves strip mining, i.e. splitting loops into an adjacent pair of tile and intra-tile loops, and permutation to bring all intra-tile loops innermost. The condition for permutability of a band of loops is that all correspondingly permuted dependence vectors must be lexicographically positive.

With imperfectly nested loops too, tiling involves strip mining loop splitting and permutation, along with loop distribution. The reasoning about legality of tiling of imperfectly nested loops requires a more general dependence model than dependence vectors. The earliest works to develop approaches to tile imperfectly nested loops effectively mapped the instances of statements of an imperfectly nested loop (of possibly different nesting depths in the input code) into a common embedding iteration space and then performed tiling in the framework of the common embedding space. A critical unresolved challenge with this approach is that of developing an effective algorithm for generating efficient tiled code where lower dimensional statements are hoisted out so that they do not have more loops surrounding them than their inherent dimensionality - heuristics have been proposed, but no generally effective implementation has been developed to our knowledge.

A major breakthrough in efficient code generation for multiple statements in imperfectly nested loops was achieved by the development of the polyhedral scanning algorithm of Quillere et al. and its refinement and optimization in CLooG. An innovative scheme for efficient parametric tiling of arbitrary polyhedral statement domains was developed by Renganarayana et al. and extended to efficiently generate multi-level parametric tiled code. However, this work is only applicable to perfect loop nests. The first effective approach for tiling of imperfectly nested loops was recently developed in the Pluto polyhedral transformation framework. However, it is only able to generate tiled code with fixed tile sizes. While the approach can handle multi-level tiling, the code generation complexity grows explosively; the latest distribution of Pluto was found to fail for some inputs when generating multi-level tiled code.

In this work on PrimeTile, we develop an effective approach to parametric multi-level tiling of imperfectly nested affine loops. The importance of parametric tiling is exemplified by the highly successful ATLAS system for empirical tuning of BLAS kernels. ATLAS uses parametrically tiled BLAS kernels that are repeatedly executed on the target architecture for different problem sizes using an empirical search strategy that varies the tile sizes. The ATLAS system was manually engineered by experts with insights into tiling for optimization of BLAS kernels. There has been much recent interest in developing generalized tuning systems that can similarly tune and optimize codes input by users or library developers. An efficient parametric tiling tool is extremely valuable for generating input tiled codes for such empirical tuning systems.

The key to our approach PrimeTile is the exploitation of the power and effectiveness of the Quillere algorithm in generating untiled imperfectly nested code, along with use of input scattering functions (affine schedules for the different statements) satisfying a generalized tiling condition and a geometric approach to generation of parametric tiles by post-processing the abstract syntax tree (AST) of the loop structure generated by CLooG.

Papers

  1. Parametric Multi-Level Tiling of Imperfectly Nested Loops
    Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan.
    ACM International Conference on Supercomputing (ICS), June 2009, New York, New York. (accepted conditionally with shepherding)

  2. PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests [ PDF ]
    Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan.
    Research Report OSU-CISRC-2/09-TR04, Computer Science and Engineering Department, The Ohio State University, Feb 2009.

Funding

This work was supported in part by the National Science Foundation through awards 0508245, 0509442, 0509467, 0541409, 0811457, and 0811781.

Related Work

Pluto -- a state-of-the-art polyhedral-based parallelizer and locality optimizer for multicore processors
CLooG -- a loop code generator for scanning Z-polyhedra
HiTLOG -- a tiled loop generator for perfectly nested loops