J. Ramanujam
Department of Electrical and Computer Engineering
Louisiana State University, Baton Rouge, LA 70803-5901
Email: jxr@ee.lsu.edu
This paper presents an approach to modeling loop transformations using linear algebra. Compound transformations are modeled as integer matrices. Non-singular linear transformations presented here subsumes the class of unimodular transformations. The loop transformations included are the unimodular transformations -- reversal, skewing, permutation -- and a new transformation, namely stretching. Non-unimodular transformations (with determinant ) create ``holes'' in the transformed iteration space, rendering code generation difficult. We solve this problem by suitably changing the step size of loops in order to ``skip'' these holes when traversing the transformed iteration space. For the class of non-unimodular loop transformations, we present algorithms for deriving the loop bounds, the array access expressions and step sizes of loops in the nest. To derive the step sizes, we compute the Hermite Normal Form of the transformation matrix; the step sizes are the entries on the diagonal of this matrix. We then use the theory of Hessenberg matrices in the derivation of exact loop bounds for non-unimodular transformations. We illustrate the use of this approach in several problems such as generation of tile sets, and distributed memory code generation. This approach provides a framework for optimizing programs for a variety of architectures.
Loop transformations are widely used by automatic parallelizers to generate efficient code for a variety of high performance computers [1,4,20,24]. The interactions among loop transformations is of particular importance for the success of parallelizing compilers. The order of applying a set of transformations has a great impact on the performance of the loop on a given target parallel machine. In addition, it is becoming increasingly common for programmers of highly concurrent machines to specify parallelism using language constructs. Hence, compound loop transformations have received great attention recently.
Banerjee [4], Dowling [6], Wolf and Lam [19] and this author [14] have derived unimodular transformations that model compound loop transformations using integer matrices with unit determinant. This paper presents an approach for restructuring nested loops using general non-singular linear transformations, the matrices associated with which need not have unit determinants. Non-singular linear transformations presented here subsumes the class of unimodular transformations. Non-unimodular transformations (with determinant ) create ``holes'' in the transformed iteration space. We change the step size of loops in order to ``step aside from these holes'' when traversing the transformed iteration space. For the class of non-unimodular loop transformations, we present algorithms for deriving the loop bounds, the array access expressions and step sizes of loops in the nest. In order to derive the step sizes, we compute the Hermite Normal Form of the transformation matrix; the step sizes are the entries on the diagonal of this matrix. In addition to these advantages, the approach provides a controlled search of the space of possible loop optimizations. We illustrate the use of this approach in several problems such as generation of tile sets, distributed memory code generation and dependence analysis.
Section 2 presents the terminology and assumptions used in this paper. Section 3 discusses unimodular loop transformations and introduces loop stretching; in addition, we argue that unimodularity is not sufficient for optimizing communication with tiling for distributed memory machines. Issues in code generation for unimodular and non-unimodular transformations are discussed in Section 4. Section 5 presents solutions to the issues of loop step size (based on the Hermite Normal Form of the transformation matrix) and exact loop bound generation (based on Hessenberg matrices), in addition to an example illustrating the use of our techniques. In Section 6, we discuss applications of the Hermite Normal From to such applications as tile set generation, and distributed memory code generation. Section 7 presents a discussion of related work. Section 8 summarizes the paper and points to other applications and work in progress.
A data dependence between two references arises from their reading or writing a common memory location. This dependence is called an input dependence if both the references are reads; input dependences do not constrain program execution ordering. If at least one of the two references is a write, the dependence can be a flow, anti or output dependence. Let and be two references to a memory location, and let be a write and be the read. If precedes in sequential execution, then the dependence is called a flow dependence. If precedes in sequential execution, then the dependence is called an anti dependence. If both the accesses to the memory location are writes, we have an output dependence.
Data dependences are characterized by the dependence distance, which is the number of iterations that separate the sink of the dependence from its source. Dependence directions which represent the sign of the dependence distance are also used to characterize dependences. Dependence distance and direction are represented as vectors called dependence vectors. The elements of the vector read from left to right indicate the dependence from the outermost to the innermost loop. We will use dependence vectors as columns vectors. A dependence matrix has as its columns, a set of dependence vectors. All dependence vectors are positive vectors, i.e., the first non-zero component in the vector (read left to right) must be positive. Any loop transformation is legal if after transformation, the dependence vectors remain positive vectors. A detailed discussion of dependences including dependence testing can be found in [1,3,20,23,24].
The set of computations in a perfectly nested loop is called the computation domain. For the purpose of this paper, only the structural information of the nested loop, i.e., the computation domain and the distance matrix D are needed. With the assumption on loop bound linearity, the sets of computations considered are finite convex polyhedra of some iteration space in , where n is the number of loops in the nest which is also the dimensionality of the iteration space. Each element, an iteration of the loop body S, is referred to by its iteration vector .
The iteration set of a given nested loop is described as a set of integer points (vectors) whose convex hull is a non-degenerate (full dimensional) polyhedron, i.e.,
where a is a positive integer and
Matrix P is called the constraint matrix of J; vector is called the size vector. The above relation simply states that is a polyhedron. Extreme points of polyhedron are always integers and belong to the iteration space J because is the convex hull of iteration space J. In addition, extreme points remain extreme points under rank-preserving linear transformations. Different size vectors correspond to instances of the same algorithm that differ only in their sizes (but have the same shape of the computation domain). The following example illustrates the concepts introduced here.
The computation domain is given by
For convenience, we represent the computation domain as the matrix
which is just the P matrix augmented with the size vector .
: Loop transformations: interchange, reversal and
skewing
A distance matrix is a collection of dependence distance vectors for a nested loop -- the distance vectors are the columns of the distance matrix. For an n-nested loop with r distance vectors , the distance matrix D is an matrix. Let be the loop index variables for the loop nest levels. The vector is referred to as the nest vector.
Loop transformations are applied to the loop nest vector and the dependence matrix. Let and be two iterations of a nested loop such that (read precedes in sequential execution) and there is a dependence of constant distance between them. Applying a linear transformation T to the iteration space (nest vector) also changes the distance matrix since . All dependence vectors are positive vectors, i.e., the first non-zero component should be positive. We do not include loop-independent dependences which are zero vectors. A transformation is legal if the resulting dependence vectors are still positive vectors [3]. Note that the legality definition applies to bounded and unbounded sets of dependence vectors; constant distance vectors are bounded dependence vectors. Thus, the linear algebraic view leads to a simpler notion of the legality of a transformation. For an n-nested sequential loop, the identity matrix () denotes sequential execution order. Loop transformations correspond to elementary row operations on matrices [4].
: Effect of unimodularity of communication
Loop stretching is a transformation that corresponds to changing the step size of a loop by an integer factor f. Stretching is always legal since it does not alter the order of execution. Loop stretching is the inverse of loop step normalization [20,24]. Multiplying a row by f where f > 1 corresponds to stretching. For example, the transformation,
changes the step size of the inner loop to 2. Figure 2 illustrates the effect of T on a iteration space. Note that within the boundary of the transformed iteration space, there are points which are skipped over; we refer to these points as ``holes.'' These are the points in the transformed iteration space which are not images of any point in the original iteration space, i.e., these are vectors for which is not integral.
Any nonsingular integer linear transformation can be realized through combinations of reversal, stretching, interchange and skewing -- this follows from the fact that a series of row operations applied to the identity matrix can generate any integer matrix [18].
In this section, we present an example to show the need for non-unimodular transformations. Iteration space tiling [9,13,16,19,20,21] is a transformation that decomposes an n-dimensional iteration space into n-dimensional ``blocks'' composed of a number of iterations where each block executes atomically; hence there should be no dependence cycles (deadlock) among the blocks. Tiling is extremely useful in distributed memory machines and hierarchical shared memory machines in exploiting locality of reference in the presence of dependences [13,16,17]. In [16], we modeled the problem of finding the best tiling transformation for distributed memory machines as (a linear programming problem) that of finding a matrix T such that the entries in are non-negative (so that there are no dependence cycles) and the sum of the entries in is a minimum; the sum of the entries in corresponds to the amount of communication (flux or volume) required. In 2-dimensional iteration spaces, the best transformation matrix is not necessarily unimodular [13]. For example, consider the distance matrix
The tiling transformation T with the least communication is
whose determinant is 2. From a transformational view, we need to find a non-singular matrix T
such that
has nonnegative entries. The optimal solutions for linear programming problems occur at the simplex vertices or corners of the feasible region. The corners of the feasible region for the above set of constraints are: and . Since we require that T is non-singular, its determinant must be nonzero i.e., , the second simplex point cannot be considered. Therefore, the solution is . In general, we need to consider only those simplex vertices where the matrix T would be non-singular. Among unimodular matrices of the form
a=1 gives the minimum communication volume. If we look for non-singular matrices of the form
and gives the smallest communication volume. Figure 3 illustrates the effect of unimodularity on communication. In Figure 3, we do not show the communication arising due to the dependence vector . The tile shape in Figure 3(a) is generated by the non-unimodular matrix
and the tile shape shape in Figure 3(b) is generated by the unimodular matrix
We note that T in this case has a determinant 2 and thus can not be generated by unimodular transformations. The reader is referred to [13] for details.
Assume that a unimodular transformation T is applied to a loop nest. The issues in code generation for T are:
Let denote the nest vector for the original iteration space and denote the transformed iteration space nest vector, i.e., is the image of under T. We assume that the array index expressions are affine functions of the loop indices i.e., of the form where is an matrix (assuming that the array being referenced is of dimension m) and the vector is an m-element vector (referred to as the offset vector). We want the same array elements accessed in and . Therefore we need accessed in as well. In order to do this, we change to , since .
Specifying a loop involves writing the values of the upper and lower limits of each loop index with the constraint that the limit of a loop index is an affine function of the values of loop indices where are the transformed loop indices. The computation domain is:
Let T be a non-singular transformation applied to the loop nest, then
In order to find the new loop bounds:
: Effect of non-unimodular transformations on
example 1.
In this section, we contrast code generation for non-unimodular transformations with that for unimodular transformation. We use the following loop skeleton to illustrate the application of our techniques:
Consider the following non-unimodular transformation (with determinant 2):
Figure 4 shows the original iteration space and the transformed iteration space for the loop in example 1 under T.
If we generate code in this case as if T were a unimodular transformation, we generate the following incorrect code:
The code generated above traverses all points within the boundary of the transformed iteration space in Figure 4(b). In order for the code to be correct, the step size of the inner loop should be 2 and the bounds of the inner loop have to be corrected. We note here that computation of array index expressions for the transformed nest is the same as with unimodular transformations; therefore, we will not discuss it any further.
The points in the transformed iteration space in Figure 4(b) that have to be skipped (holes) are those for which there is no corresponding point in the original iteration space, i.e., points in the transformed iteration space for which is not integral are the holes. This suggests the following simple code generation strategy -- in the body of the code generated as if T were unimodular, add a guard for each iteration as shown below:
This incurs expensive runtime overhead. In the next section, we show how to generate code which avoids expensive runtime overhead. Thus, the remaining issues in code generation (discussed next) are:
We generalize unimodular transformations described by [4,19] to derive non-singular linear transformations defined by integer matrices whose determinant is not restricted to . This section presents details of code generation for non-unimodular loop transformations. Iteration spaces defined by nested loops form bounded integer lattices. The theory presented here is based on the fact that the lattice generated by the columns of a matrix T is the same as the lattice generated by the Hermite Normal Form (defined below) of T. The entries on the diagonal of the Hermite Normal Form of T give the loop step sizes. In the case of n-nested loops with step size 1, the iteration space is a lattice generated by the columns of the identity matrix [14], since any point in the iteration space can be written as integral linear combinations of the columns of . If the step size along loop level j is , then the corresponding iteration space is generated by a matrix which differs from in that the entry .
An matrix C is unimodular if it is integral and its determinant is . The set
where T is an matrix is called the lattice generated by the columns of T.
Proof: Let x=Cz. Then we have
Since C is an integer matrix, implies . Since C is unimodular, is also unimodular and integral; hence implies that or . Thus, it follows that . Therefore, .
An non-singular matrix H is said to be in Hermite Normal Form if:
The right multiplication of T by a unimodular matrix C corresponds to a sequence of elementary column operations defined below:
To compute the step sizes, we first compute the Hermite Normal Form, H of the transformation matrix, T. Since the columns of H and T generate the same lattice, the entries on the diagonal of H are the step sizes of the loops in the nest.
In this section, we discuss how to correct the loop bounds generated by using the procedure in Section 4.1.2. For this, we need closed form expressions for the solution of lower triangular matrices. Fortunately, the theory of lower Hessenberg matrices [7] comes to our rescue and allows us to derive closed form solutions for inverting lower triangular matrices. The reader may note that, while the theory of Hessenberg matrices is actually an application of Cramer's rule and determinant expansions, the derived closed form solutions are quite useful in a compiler that implements the transformations.
Every valid point in the transformed iteration space must be the image of a point in the original iteration space. This means that must be integral. Since the columns of the matrices T and H generate the same lattice, it is sufficient to check that is integral for every point in the transformed iteration space. These checks are expensive since they add to runtime overhead. Given that the step sizes are correct, if we hit the right starting point for each loop, then we generate all and only the valid points in the transformed iteration space.
Consider the problem of solving the linear system , where H is lower triangular. The problem and the solution () are shown below for a matrix, H.
To derive closed form expressions for , we use determinants of the Hessenberg matrix M associated with H defined as follows:
Notice that M is obtained from H by dividing each column in H by the diagonal element on that column, then changing the sign of each element of H, dropping its last column, and adding the vector as the new first column. Let refer to the sub-matrix of M formed by rows i thru j and columns i thru j where . For example,
Let be the determinant of . Thus,
In general,
and
Thus, using lower Hessenberg matrices, we have closed form expressions for .
Let the nest vector for the original iteration space be and be the nest vector for the transformed iteration space under T. Let and be the lower and upper bounds of the transformed loops as calculated using the procedure given in Section 4.1.2. Since floors and ceilings are used in computing respectively the lower and upper bounds, and are accurate boundaries of the transformed iteration space if T were unimodular; where as the closest integer point computed using the steps in Section 4.1.2 need not be the image of a point in the source iteration space. While no point outside the boundary computed as above belongs to the transformed iteration space, there may be points inside which do not belong. In the following discussion, we use and () to refer to the exact bounds.
To generate only the valid points in the transformed iteration space, the loop body of the transformed nest must be executed only for for which is integral. From the above, results, it follows that for to be integral, for . We note here the bounds of the outermost loop are always exact, since the image of a corner of a polyhedron is a corner under invertible linear transformations. We apply the correction to the loop bounds for loops . The basic idea behind the correction is as follows: if is not exact, then there exists an integer such that and is exact. We need to find without searching for it. The theory of Hessenberg matrices comes to our rescue. We use the following modified form of the Hessenberg matrix:
The only difference between used here and the matrix M in Section 5.2.1 is that, we use (the loop bound calculated by the procedure in Section 4.1.2 for loop variable ) instead of . Let be the determinant of the matrix . and are similarly defined (using in the place of ). Let denote the expression and denote the expression . If is exact, then . If is not exact, ; hence, . Similarly, if is not exact, then the correct bound is . Thus, the exact lower bound is given by
The exact upper bound is
The exact bounds derived here obviate the need for expensive runtime checks used in [2,5,11].
We refer to the loop in Example 1; for convenience, we reproduce the example below:
Consider the following non-unimodular transformation
In order to restructure the loop nest, we first find the HNF of T. The matrix H (the HNF of T) is derived through the following column operation:
Therefore, in the transformed loop nest, the step size of the outer loop is 1 and the inner loop is 2. The matrix is:
The computation domain (P matrix augmented with vector ) is:
In order to compute the loop bounds, we first compute :
which can be scaled to become:
If we replace row 1 by the sum of rows 1 and 3 and if we replace row 2 by the sum of rows 2 and 4, we get the bounds of the outer loop from
Thus, the new outer loop limits are 2 and 2N and the step size is 1. Similarly, we can derive the loop limits of the inner loop whose step size is 2. The (incorrect) inner loop bounds are:
In all cases, the lower bound as calculated is 0. The and matrices are:
Thus, and . Therefore, the corrected loop bounds are:
Since , we can write as . The correct transformed code:
We note here that the correction factor applied to loop bounds follows a regular pattern. In ongoing work [15], we have characterized the sequence of applied correction factors as a finite-state machine that depends on the computed loop step size and the shape of the source iteration space. The use of the finite-state machine reduces runtime overhead.
In tiling a nest of loops, the set of tiles (referred to as the tile set) is defined as a collection of origins of the tiles. An origin of a tile, is the lexicographically earliest (smallest) iteration that belongs to the tile. Hence, the origins of tiles in n-dimensional iteration spaces are generated by a matrix whose columns form the basis of the lattice. In many cases, the lattice of tile origins are defined by a user through an input matrix or automatically generated by an optimizing compiler. We use the same terminology as [2] here. Let L be an matrix specifying the lattice of tile origins for a loop nest with a computation domain given by . We show how to generate the origins of the tile sets.
We use the following loop nest (from [2]):
for which tile origins are given by the lattice generated by the columns of L, where L is:
The problem of enumerating tile origins is equivalent to applying the transformation denoted by the matrix :
The computation domain (P matrix augmented with vector ) is:
In order to compute the loop bounds, we first compute PL (since ).
If we replace row 1 by the sum of rows 1 and 3, we get the bounds of the outer loop as and . Similarly, we can derive the loop bounds of the inner loop. The loop that traverses the origins of non-empty tiles is:
If we shift the outer loop to 0, we have:
The points in the transformed iteration space are the origins of non-empty tiles.
Current efforts in distributed memory code generation start with a user-defined decomposition of data structures [8]. Extensions to a serial language allow users to specify the decomposition of arrays. A processor is said to own elements of data allocated to it. The Fortran-D research group at Rice notes: ``We find that selecting a data decomposition is one of the most important intellectual steps in developing data parallel scientific codes'' [8]. The compiler automatically partitions the program using the owner compute rule by which each processor computes values of data it owns. The data owned by a processor is known as its local index set and the set of loop iterations that compute owned data is referred to as the local iteration set [8].
Non-singular transformations may be needed in deriving the local iteration sets; we compute the owner loop bounds so that processors avoid executing expensive guards looking for work to do. Consider the loop shown below:
Assume for this example that the locally owned part of B in some processor is given by . The subscript functions can be represented in matrix form as , or
where corresponds to a point in the data space and corresponds to a point in the iteration space. In order to enforce the owner compute rule, and . That is,
Recall that if the access function is , on applying a transformation T to the loop, the access function becomes where is a vector of the transformed loop indices. Therefore if a transformation T = A is legal, the access function becomes ; this renders distributed memory code generation easy. For Example 3, to generate distributed memory code for this example, we first apply the transformation T which is the same as the data access function represented as a matrix:
The Hermite Normal Form of T is
The output generated by our algorithm for T applied to the loop in Example 3 is:
In order to ensure that only local data is accessed, we need to include those in the bounds. We get:
In the following example, we assume that the transformation A (the matrix part of the access function) is not legal and show how to generate code by just adjusting loop bounds. Consider the loop shown below:
Assume for this example that the locally owned part of D in some processor is given by . The subscript functions can be represented in matrix form as , or
where corresponds to a point in the data space and corresponds to a point in the iteration space. In order to enforce the owner compute rule, and . That is,
which is equivalent to the following system of inequalities:
We reduce the above system to row echelon form that results in the following local iteration set for the processor avoiding expensive guards:
The code generation technique is also applicable for cyclic and block-cyclic forms of data decomposition as well. The Hermite Normal Form is also useful in solving systems of dependence equations which are systems of linear Diophantine equations. It can be used to check if there are no integer solutions; if there are integer solutions, HNF can be used to find a general form of the solutions.
Dowling [6] presents an integer programming solution to the problem of optimal code parallelization using unimodular matrices that transform a given nested loop into parallel form. Banerjee [4] and Wolf and Lam [19] present unimodular transformations of loop nests and algorithms to derive transformations for maximal parallelism. Barnett and Lengauer [5] conclude that unimodularity is not essential; to derive code for non-unimodular transformations, they generate piecewise loop bounds which are not exact. Ancourt and Irigoin [2] and Lu [11] use conditionals in their transformed code to ensure correctness of the transformation. The techniques in this paper obviate the need for such conditionals. The techniques presented in [2,11,5] result in runtime checks. Wolfe [22] presents a technique that searches through a search space comprising both linear and non-linear transformations of the iteration space. In contrast to these approaches, our work presents a theory of non-singular linear transformations of the iteration space characterized by integer matrices whose determinant is an arbitrary integer. We show an example of the need for non-unimodular transformations; the transformations are derived through modifying the step size of the loops. Li and Pingali [10] have independently derived a technique for code generation for general non-singular loop transformations. They present a completion algorithm which given a partial transformation generates a full non-singular transformation matrix that preserves dependences in the loop nest. Their solution decomposes a compound transformation represented by a non-singular matrix into a unimodular transformation followed by a non-unimodular transformation; the second transformation does not change the relative order of execution of iterations. The loop bounds generated involve complex integer operations. In ongoing work [15], we have characterized the sequence of correction factors as a finite-state machine. This reduces the overhead arising from expensive integer operations in loop bound calculations. In addition, their technique does not directly work for applications such as distributed memory code generation.
Compound loop transformations modeled as linear transformations have received great attention recently. In this paper, we presented a linear algebraic approach to modeling loop transformations. We presented techniques for code generation for non-unimodular loop transformations. Unimodular transformations [4,19] form a subclass of non-singular linear transformations presented here. Non-unimodular transformations (with determinant ) create ``holes'' in the transformed iteration space. We change the step size of loops in order to ``skip'' these holes when traversing the transformed iteration space. For the class of non-unimodular loop transformations, we presented algorithms for deriving the loop bounds, the array access expressions and step sizes of loops in the nest. To derive the step sizes, we compute the Hermite Normal Form of the transformation matrix; the step sizes are the entries on the diagonal of this matrix. We then illustrated the use these techniques in important applications such as generating tile sets, and deriving local iteration sets in distributed memory code generation. The Hermite Normal Form is also useful in dependence analysis. The technique presented here holds promise in creating a framework for optimizing programs for a wide variety of architectures.
We thank the anonymous referees for their comments on an earlier draft of the paper.
Beyond Unimodular Transformations
This document was generated using the LaTeX2HTML translator Version 95 (Thu Jan 19 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -t TJS paper -split 0 -no_navigation revised.tex.
The translation was initiated by J. Ramanujam on Fri Mar 3 15:03:11 CST 1995