In Figure 16.3(a), the *input/output scheme* is only sketched by the *flow directions* for the matrices . The necessary details to understand the input/output operations are now provided by Figure 16.6.

The **input/output scheme** in Figure 16.6 shows some new phenomena when compared with Figure 16.1(a). The input and output cells belonging to any matrix are no longer threaded all on a single straight line; now, for each matrix, they lie along two adjacent *borders*, that additionally may differ in the number of links to the outside world. The data structures from Figure 16.6 also differ from that in Figure 16.1(a) in the angle of inclination. Moreover, the matrices and from Figure 16.6 arrive at the *boundary cells* with only one third of the *data rate*, compared to Figure 16.1(a).

Spending some effort, even here it might be possible in principle to construct—item by item—the appropriate input/output scheme fitting the present systolic array. But it is much more safe to apply a formal derivation. The following subsections are devoted to the presentation of the various methodical steps for achieving our goal.

First, we need to construct a formal relation between the abstract data structures and the concrete variable instances in the assignment-free representation.

Each item of the matrix can be characterised by a row index and a column index . These **data structure indices** can be comprised in a **data structure vector**
. Item in system (16.3) corresponds to the instances , with any . The coordinates of these instances all lie on a line along direction in space . Thus, in this case, the formal change from data structure vector to coordinates can be described by the transformation

In system (16.3), the coordinate vector of every variable instance equals the iteration vector of the domain point representing the calculation of this variable instance. Thus we also may interpret formula (16.23) as a relation between *data structure vectors* and *iteration vectors*. Abstractly, the desired iteration vectors can be inferred from the data structure vector by the formula

The affine vector is necessary in more general cases, though always null in our example.

Because of , the representation for matrix correspondingly is

Concerning matrix , each variable instance may denote a different value. Nevertheless, all instances to a fixed index pair can be regarded as belonging to the same matrix item , since they all stem from the serialisation of the sum operator for the calculation of . Thus, for matrix , following formula (16.24) we may set

Each of the three matrices is generated by two directions with regard to the *data structure indices*: along a row, and along a column. The difference vector (0,1) thereby describes a move from an item to the next item of the same row, i.e., in the next column: . Correspondingly, the difference vector (1,0) stands for sliding from an item to the next item in the same column and next row: .

*Input/output schemes* of the appearance shown in Figures 16.1(a) and 16.6 denote **snapshots:** all positions of data items depicted, with respect to the entire systolic array, are related to a common timestep.

As we can notice from Figure 16.6, the rectangular shapes of the abstract data structures are mapped to parallelograms in the snapshot, due to the linearity of the applied space-time transformation. These parallelograms can be described by difference vectors along their borders, too.

Next we will translate difference vectors from data structure vectors into spatial difference vectors for the snapshot. Therefore, by choosing the parameter in formula (16.24), we pick a pair of iteration vectors that are mapped to the same timestep under our space-time transformation. For the moment it is not important which concrete timestep we thereby get. Thus, we set up

implying

and thus

Due to the linearity of all used transformations, the wanted spatial difference vector hence follows from the difference vector of the data structure as

or

With the aid of formula (16.31), we now can determine the spatial difference vectors for matrix . As mentioned above, we have

Noting , we get

For the rows, we have the difference vector , yielding the spatial difference vector . Correspondingly, from for the columns we get . If we check with Figure 16.6, we see that the rows of in fact run along the vector , the columns along the vector .

Similarly, we get for the rows of , and for the columns of ; as well as for the rows of , and for the columns of .

Applying these instruments, we are now able to reliably generate appropriate input/output schemes—although separately for each matrix at the moment.

Now, the shapes of the matrices for the snapshot have been fixed. But we still have to adjust the matrices relative to the systolic array—and thus, also relative to each other. Fortunately, there is a simple graphical method for doing the task.

We first choose an arbitrary iteration vector, say . The latter we map with the projection matrix to the cell where the calculation takes place,

The iteration vector (1,1,1) represents the calculations , , and ; these in turn correspond to the data items , , and . We now lay the input/output schemes for the matrices on the systolic array in a way that the entries , , and all are located in cell .

In principle, we would be done now. Unfortunately, our input/output schemes overlap with the cells of the systolic array, and are therefore not easily perceivable. Thus, we simultaneously retract the input/output schemes of all matrices in counter flow direction, place by place, until there is no more overlapping. With this method, we get exactly the input/output scheme from Figure 16.6.

As an alternative to this nice graphical method, we also could formally calculate an overlap-free placement of the various input/output schemes.

Only after specifying the input/output schemes, we can correctly calculate the number of timesteps effectively needed. The first relevant timestep starts with the first input operation. The last relevant timestep ends with the last output of a result. For the example, we determine from Figure 16.6 the beginning of the calculation with the input of the data item in timestep 0, and the end of the calculation after output of the result in timestep 14. Altogether, we identify 15 timesteps—five more than with pure treatment of the real calculations.

The input schemes of the matrices and from Figure 16.1(a) have a dense layout: if we drew the borders of the matrices shown in the figure, there would be no spare places comprised.

Not so in Figure 16.6. In any input data stream, each data item is followed by two spare places there. For the input matrices this means: the *boundary cells* of the systolic array receive a proper data item only every third timestep.

This property is a direct result of the employed space-time transformation. In both examples, the abstract data structures themselves are dense. But how close the various items really come in the input/output scheme depends on the absolute value of the determinant of the transformation matrix : in every input/output data stream, the proper items follow each other with a spacing of exactly places. Indeed for Figure 16.1; as for Figure 16.6, we now can rate the fluffy spacing as a practical consequence of .

What to do with spare places as those in Figure 16.6? Although each cell of the systolic array from Figure 16.3 in fact does useful work only every third timestep, it would be nonsense to pause during two out of three timesteps. Strictly speaking, we can argue that values on places marked with dots in Figure 16.6 have no influence on the calculation of the shown items , because they never reach an active cell at time of the calculation of a variable . Thus, we may simply fill spare places with any value, no danger of disturbing the result. It is even feasible to execute three different matrix products at the same time on the systolic array from Figure 16.3, without interference. This will be our topic in Subsection 16.3.7.

When further studying Figure 16.6, we can identify another problem. Check, for example, the itinerary of through the cells of the systolic array. According to the space-time transformation, the calculations contributing to the value of happen in the cells , , , and . But the input/output scheme from Figure 16.6 tells us that also passes through cell before, and eventually visits cell , too.

This may be interpreted as some **spurious calculations** being introduced into the system (16.3) by the used space-time transformation, here, for example, at the new domain points (2,2,0) and (2,2,5). The reason for this phenomenon is that the *domains of the input/output operations* are not in parallel to the chosen *projection direction*. Thus, some input/output operations are projected onto cells that do not belong to the **boundary** of the systolic array. But in the interior of the systolic array, no input/output operation can be performed directly. The problem can be solved by extending the trajectory, in flow or counter flow direction, from these inner cells up to the boundary of the systolic array. But thereby we introduce some new calculations, and possibly also some new domain points. This technique is called **input/output expansion**.

We must avoid that the additional calculations taking place in the cells (-2,0) and (3,0) corrupt the correct value of . For the matrix product, this is quite easy—though the general case is more difficult. The generic sum operator has a neutral element, namely zero. Thus, if we can guarantee that by new calculations only zero is added, there will be no harm. All we have to do is providing always at least one zero operand to any spurious multiplication; this can be achieved by filling appropriate input slots with zero items.

Figure 16.7 shows an example of a properly extended input/output scheme. Preceding and following the items of matrix , the necessary zero items have been filled in. Since the entered zeroes count like data items, the input/output scheme from Figure 16.6 has been retracted again by one place. The calculation now begins already in timestep , but ends as before with timestep 14. Thus we need 16 timesteps altogether.

Let us come back to the example from Figure 16.1(a). For inputting the items of matrices and , no expansion is required, since these items are always used in *boundary cells* first. But not so with matrix ! The items of are calculated in stationary variables, hence always in the same cell. Thus most results are produced in inner cells of the systolic array, from where they have to be moved—in a separate action—to *boundary cells* of the systolic array.

Although this new challenge, on the face of it, appears very similar to the problem from Subsection 16.3.5, and thus very easy to solve, in fact we here have a completely different situation. It is not sufficient to extend existing data flows forward or backward up to the boundary of the systolic array. Since for stationary variables the dependence vector is projected to the null vector, which constitutes no extensible direction, there can be no spatial flow induced by this dependency. Possibly, we can construct some auxiliary extraction paths, but usually there are many degrees of freedom. Moreover, we then need a *control mechanism* inside the cells. For all these reasons, the problem is further dwelled on in Section 16.4.

As can be easily noticed, the *utilisation* of the systolic array from Figure 16.3 with input/output scheme from Figure 16.7 is quite poor. Even without any deeper study of the starting phase and the closing phase, we cannot ignore that the average utilisation of the array is below one third—after all, each cell at most in every third timestep makes a proper contribution to the calculation.

A simple technique to improve this behaviour is to **interleave** calculations. If we have three independent matrix products, we can successively input their respective data, delayed by only one timestep, without any changes to the systolic array or its cells. Figure 16.8 shows a snapshot of the systolic array, with parts of the corresponding input/output scheme. Now we must check by a formal derivation whether this idea is really working. Therefore, we slightly modify system (16.3). We augment the variables and the domains by a fourth dimension, needed to distinguish the three matrix products:

**Figure 16.8. Interleaved calculation of three matrix products on the systolic array from Figure 16.3.**

Obviously, in system (16.32), problems with different values of are not related. Now we must preserve this property in the systolic array. A suitable *space-time matrix* would be

Notice that is not square here. But for calculating the space coordinates, the fourth dimension of the iteration vector is completely irrelevant, and thus can simply be neutralised by corresponding zero entries in the fourth column of the first and second rows of .

The last row of again constitutes the *time vector*
. Appropriate choice of embeds the three problems to solve into the space-time continuum, avoiding any intersection. Corresponding instances of the iteration vectors of the three problems are projected to the same cell with a respective spacing of one timestep, because the fourth entry of equals 1.

Finally, we calculate the average *utilisation*—with or without interleaving—for the concrete problem parameters , , and . For a single matrix product, we have to perform calculations, considering a multiplication and a corresponding addition as a compound operation, i.e., counting both together as only one calculation; input/output operations are not counted at all. The systolic array has 36 cells.

Without interleaving, our systolic array altogether takes 16 timesteps for calculating a single matrix product, resulting in an average utilisation of calculations per timestep and cell. When applying the described interleaving technique, the calculation of all three matrix products needs only two timesteps more, i.e., 18 timesteps altogether. But the number of calculations performed thereby has tripled, so we get an average utilisation of the cells amounting to calculations per timestep and cell. Thus, by interleaving, we were able to improve the utilisation of the cells to 267 per cent!

**Exercises**

16.3-1 From equation (16.31), formally derive the spatial difference vectors of matrices and for the input/output scheme shown in Figure 16.6.

16.3-2 Augmenting Figure 16.6, draw an extended input/output scheme that forces both operands of all spurious multiplications to zero.

16.3-3 Apply the techniques presented in Section 16.3 to the systolic array from Figure 16.1.

16.3-4 Proof the properties claimed in Subsection 16.3.7 for the special space-time transformation (16.33) with respect to system (16.32).