Although the approach taken in the preceding section should be sufficient for a basic understanding of the topic, we have to work harder to describe and judge the properties of systolic arrays in a quantitative and precise way. In particular the solution of parametric problems requires a solid mathematical framework. So, in this section, we study central concepts of a formal theory on uniform algorithms, based on linear transformations.
System (16.3) can be computed by a multitude of other systolic arrays, besides that from Figure 16.1. In Figure 16.3, for example, we see such an alternative systolic array. Whereas the same function is evaluated by both architectures, the appearance of the array from Figure 16.3 is very different:
The number of cells now is considerably larger, altogether 36, instead of 15.
The shape of the array is hexagonal, instead of rectangular.
Each cell now has three input ports and three output ports.
The input scheme is clearly different from that of Figure 16.1(a).
And finally: the matrix here also flows through the whole array.
The cell structure from Figure 16.3(b) at first view does not appear essentially distinguished from that in Figure 16.1(b). But the differences matter: there are no cyclic paths in the new cell, thus stationary variables can no longer appear. Instead, the cell is provided with three input ports and three output ports, passing items of all three matrices through the cell. The direction of communication at the ports on the right and left borders of the cell has changed, as well as the assignment of the matrices to the ports.
Figure 16.3. Hexagonal systolic array for matrix product. (a) Array structure and principle of the data input/output. (b) Cell structure.
How system (16.3) is related to Figure 16.3? No doubt that you were able to fully understand the operation of the systolic array from Section 16.1 without any special aid. But for the present example this is considerably more difficult—so now you may be sufficiently motivated for the use of a mathematical formalism.
We can assign two fundamental measures to each elementary operation of an algorithm for describing the execution in the systolic array: the time when the operation is performed, and the position of the cell where the operation is performed. As will become clear in the sequel, after fixing the so-called space-time transformation there are hardly any degrees of freedom left for further design: practically all features of the intended systolic array strictly follow from the chosen space-time transformation.
by the iteration vector
so in this case
The space coordinates of the executed operations in the example from Figure 16.1 can be inferred as from the iteration vector according to our decision in Subsection 16.1.3. The chosen map is a projection of the space along the axis. This linear map can be described by a projection matrix
To find the space coordinates, we multiply the projection matrix by the iteration vector , written as
The projection direction can be represented by any vector perpendicular to all rows of the projection matrix,
For the projection matrix from (16.8), one of the possible projection vectors would be .
Projections are very popular for describing the space coordinates when designing a systolic array. Also in our example from Figure 16.3(a), the space coordinates are generated by projecting the iteration vector. Here, a feasible projection matrix is given by
A corresponding projection vector would be .
We can combine the projection matrix and the time vector in a matrix , that fully describes the space-time transformation,
The first and second rows of are constituted by the projection matrix , the third row by the time vector .
For the example from Figure 16.1, the matrix giving the space-time transformation reads as
for the example from Figure 16.3 we have
Space-time transformations may be understood as a global view to the systolic system. Applying a space-time transformation—that is linear, here, and described by a matrix —to a system of recurrence equations directly yields the external features of the systolic array, i.e., its architecture—consisting of space coordinates, connection pattern, and cell structure.
Remark. Instead of purely linear maps, we alternatively may consider general affine maps, additionally providing a translative component, . Though as long as we treat all iteration vectors with a common space-time transformation, affine maps are not really required.
If the domains are numerically given and contain few points in particular, we can easily calculate the concrete set of space coordinates via equation (16.9). But when the domains are specified parametrically as in system (16.3), the positions of the cells must be determined by symbolic evaluation. The following explanation especially dwells on this problem.
Suppose that each cell of the systolic array is represented geometrically by a point with space coordinates in the two-dimensional space . From each iteration vector of the domain , by equation (16.9) we get the space coordinates of a certain processor, : the operations denoted by are projected onto cell . The set of space coordinates states the positions of all cells in the systolic array necessary for correct operation.
To our advantage, we normally use domains that can be described as the set of all integer points inside a convex region, here a subset of —called dense convex domains. The convex hull of such a domain with a finite number of domain points is a polytope, with domain points as vertices. Polytopes map to polytopes again by arbitrary linear transformations. Now we can make use of the fact that each projection is a linear transformation. Vertices of the destination polytope then are images of vertices of the source polytope.
Remark. But not all vertices of a source polytope need to be projected to vertices of the destination polytope, see for instance Figure 16.4.
Figure 16.4. Image of a rectangular domain under projection. Most interior points have been suppressed for clarity. Images of previous vertex points are shaded.
When projected by an integer matrix , the lattice maps to the lattice if can be extended by an integer time vector to a unimodular space-time matrix . Practically any dense convex domain, apart from some exceptions irrelevant to usual applications, thereby maps to another dense convex set of space coordinates, that is completely characterised by the vertices of the hull polytope. To determine the shape and the size of the systolic array, it is therefore sufficient to apply the matrix to the vertices of the convex hull of .
Remark. Any square integer matrix with determinant is called unimodular. Unimodular matrices have unimodular inverses.
We apply this method to the integer domain
from system (16.3). The vertices of the convex hull here are
For the projection matrix from (16.11), the vertices of the corresponding image have the positions
Since has eight vertices, but the image only six, it is obvious that two vertices of have become interior points of the image, and thus are of no relevance for the size of the array; namely the vertices and . This phenomenon is sketched in Figure 16.4.
The settings , , and yield the vertices (3,0), (3,-2), (0,-2), (-4,2), (-4,4), and (-1,4). We see that space coordinates in principle can be negative. Moreover, the choice of an origin—that here lies in the interior of the polytope—might not always be obvious.
As the image of the projection, we get a systolic array with hexagonal shape and parallel opposite borders. On these, we find , , and integer points, respectively; cf. Figure 16.5. Thus, as opposed to our first example, all problem parameters here are also array parameters.
The area function of this region is of order , and thus depends on all three matrix dimensions. So this is quite different from the situation in Figure 16.1(a), where the area function—for the same problem—is of order .
Improving on this approximate calculation, we finally count the exact number of cells. For this process, it might be helpful to partition the entire region into subregions for which the number of cells comprised can be easily determined; see Figure 16.5. The points (0,0), , , and are the vertices of a rectangle with cells. If we translate this point set up by cells and right by cells, we exactly cover the whole region. Each shift by one cell up and right contributes just another cells. Altogether this yields cells.
For , , and we thereby get a number of 36 cells, as we have already learned from Figure 16.3(a).
The running time of a systolic algorithm can be symbolically calculated by an approach similar to that in Subsection 16.2.3. The time transformation according to formula (16.6) as well is a linear map. We find the timesteps of the first and the last calculations as the minimum resp. maximum in the set of execution timesteps. Following the discussion above, it thereby suffices to vary over the vertices of the convex hull of .
The running time is then given by the formula
Adding one is mandatory here, since the first as well as the last timestep belong to the calculation.
With the basic assumption , we get a minimum of 3 and a maximum of , thus a running time of timesteps, as for the systolic array from Figure 16.1—no surprise, since the domains and the time vectors agree.
For the special problem parameters , , and , a running time of timesteps can be derived.
If , the systolic algorithm shows a running time of order , using systolic cells.
The communication topology of the systolic array is induced by applying the space-time transformation to the data dependences of the algorithm. Each data dependence results from a direct use of a variable instance to calculate another instance of the same variable, or an instance of another variable.
Remark. In contrast to the general situation where a data dependence analysis for imperative programming languages has to be performed by highly optimising compilers, data dependences here always are flow dependences. This is a direct consequence from the assignment-free notation employed by us.
The data dependences can be read off the quantified equations in our assignment-free notation by comparing their right and left sides. For example, we first analyse the equation from system (16.3).
The value is calculated from the values , , and . Thus we have a data flow from to , a data flow from to , and a data flow from to .
All properties of such a data flow that matter here can be covered by a dependence vector, which is the iteration vector of the calculated variable instance minus the iteration vector of the correspondingly used variable instance.
The iteration vector for is ; that for is . Thus, as the difference vector, we find
Correspondingly, we get
In the equation from system (16.3), we cannot directly recognise which is the calculated variable instance, and which is the used variable instance. This example elucidates the difference between equations and assignments. When fixing that should follow from by a copy operation, we get the same dependence vector as in (16.20). Correspondingly for the equation .
A variable instance with iteration vector is calculated in cell . If for this calculation another variable instance with iteration vector is needed, implying a data dependence with dependence vector , the used variable instance is provided by cell . Therefore, we need a communication from cell to cell . In systolic arrays, all communication has to be via direct static links between the communicating cells. Due to the linearity of the transformation from (16.9), we have .
If , communication happens exclusively inside the calculating cell, i.e., in time, only—and not in space. Passing values in time is via registers of the calculating cell.
Whereas for , a communication between different cells is needed. Then a link along the flow direction must be provided from/to all cells of the systolic array. The vector , oriented in counter flow direction, leads from space point to space point .
If there is more than one dependence vector , we need an appropriate link for each of them at every cell. Take for example the formulas (16.19), (16.20), and (16.21) together with (16.11), then we get , , and . In Figure 16.3(a), terminating at every cell, we see three links corresponding to the various vectors . This results in a hexagonal communication topology—instead of the orthogonal communication topology from the first example.
Now we apply the space-related techniques from Subsection 16.2.5 to time-related questions. A variable instance with iteration vector is calculated in timestep . If this calculation uses another variable instance with iteration vector , the former had been calculated in timestep . Hence communication corresponding to the dependence vector must take exactly timesteps.
Since (16.6) describes a linear map, we have . According to the systolic principle, each communication must involve at least one register. The dependence vectors are fixed, and so the choice of a time vector is constrained by
In case , we must provide registers for stationary variables in all cells. But each register is overwritten with a new value in every timestep. Hence, if , the old value must be carried on to a further register. Since this is repeated for timesteps, the cell needs exactly registers per stationary variable. The values of the stationary variable successively pass all these registers before eventually being used. If , the transport of values analogously goes by registers, though these are not required to belong all to the same cell.
For each dependence vector , we thus need an appropriate number of registers. In Figure 16.3(b), we see three input ports at the cell, corresponding to the dependence vectors , , and . Since for these we have . Moreover, due to (16.7) and (16.4). Thus, we need one register per dependence vector. Finally, the regularity of system (16.3) forces three output ports for every cell, opposite to the corresponding input ports.
Good news: we can infer in general that each cell needs only a few registers, because the number of dependence vectors is statically bounded with a system like (16.3), and for each of the dependence vectors the amount of registers has a fixed and usually small value.
The three input and output ports at every cell now permit the use of three moving matrices. Very differently from Figure 16.1, a dot product here is not calculated within a single cell, but dispersed over the systolic array. As a prerequisite, we had to dissolve the sum into a sequence of single additions. We call this principle a distributed generic operator.
Apart from the three input ports with their registers, and the three output ports, Figure 16.3(b) shows a multiplier chained to an adder. Both units are induced in each cell by applying the transformation (16.9) to the domain of the equation from system (16.3). According to this equation, the addition has to follow the calculation of the product, so the order of the hardware operators as seen in Figure 16.3(b) is implied.
The source cell for each of the used operands follows from the projection of the corresponding dependence vector. Here, variable is related to the dependence vector . The projection constitutes the flow direction of matrix . Thus the value to be used has to be expected, as observed by the calculating cell, in opposite direction , in this case from the port in the lower left corner of the cell, passing through register A. All the same, comes from the right via register B, and from above through register C. The calculated values , , and are output into the opposite directions through the appropriate ports: to the upper right, to the left, and downwards.
If alternatively we use the projection matrix from (16.8), then for we get the direction . The formula results in the requirement of exactly one register C for each item of the matrix . This register provides the value for the calculation of , and after this calculation receives the value . All this reasoning matches with the cell from Figure 16.1(b). Figure 16.1(a) correspondingly shows no links for matrix between the cells: for the matrix is stationary.
16.2-1 Each projection vector induces several corresponding projection matrices .
a. Show that
also is a projection matrix fitting with projection vector .
b. Use this projection matrix to transform the domain from system (16.3).
c. The resulting space coordinates differ from that in Subsection 16.2.3. Why, in spite of this, both point sets are topologically equivalent?
d. Analyse the cells in both arrangements for common and differing features.