So far we have assumed that each cell of a systolic array behaves in completely the same way during every timestep. Admittedly there are some relevant examples of such systolic arrays. However, in general the cells successively have to work in several **operation modes**, switched to by some control mechanism. In the sequel, we study some typical situations for exerting control.

The cell from Figure 16.3(b) contains the registers *A*, *B*, and *C*, that—when activated by the global clock signal—accept the data applied to their inputs and then reliably reproduce these values at their outputs for one clock cycle. Apart from this system-wide activity, the function calculated by the cell is invariant for all timesteps: a *fused multiply-add* operation is applied to the three input operands , , and , with result passed to a neighbouring cell; during the same cycle, the operands and are also forwarded to two other neighbouring cells. So in this case, the cell needs *no control* at all.

The *initial values*
for the execution of the generic sum operator—which could also be different from zero here—are provided to the systolic array via the *input streams*, see Figure 16.7; the *final results*
continue to flow into the same direction up to the boundary of the array. Therefore, the input/output activities for the cell from Figure 16.3(b) constitute an intrinsic part of the normal cell function. The price to pay for this extremely simple cell function without any control is a restriction in all three dimensions of the matrices: on a systolic array like that from Figure 16.3, with *fixed* array parameters , an matrix can only be multiplied by an matrix if the relations , , and hold.

In this respect, constraints for the array from Figure 16.1 are not so restrictive: though the problem parameters and also are bounded by and , there is no constraint for . Problem parameters unconstrained in spite of fixed array parameters can only emerge in time but not in space, thus mandating the use of *stationary variables*.

Before a new calculation can start, each register assigned to a stationary variable has to be *reset* to an initial state independent from the previously performed calculations. For instance, concerning the systolic cell from Figure 16.3(b), this should be the case for register *C*. By a *global* signal similar to the clock, register *C* can be cleared in all cells at the same time, i.e., reset to a zero value. To prevent a corruption of the reset by the current values of *A* or *B*, at least one of the registers *A* or *B* must be *cleared* at the same time, too. Figure 16.9 shows an array structure and a cell structure implementing this idea.

Unfortunately, for the matrix product the principle of the global control is not sufficient without further measures. Since the systolic array presented in Figure 16.1 even lacks another essential property: the results are not passed to the boundary but stay in the cells.

At first sight, it seems quite simple to forward the results to the boundary: when the calculation of an item is finished, the links from cell to the neighbouring cells and are no longer needed to forward items of the matrices and . These links can be reused then for any other purpose. For example, we could pass all items of through the downward-directed links to the lower border of the systolic array.

But it turns out that leading through results from the upper cells is hampered by ongoing calculations in the lower parts of the array. If the result , finished in timestep , would be passed to cell in the next timestep, a conflict would be introduced between two values: since only one value per timestep can be sent from cell via the lower port, we would be forced to keep either or , the result currently finished in cell . This effect would spread over all cells down.

To fix the problem, we could **slow down** the forwarding of items . If it would take two timesteps for to pass a cell, no collisions could occur. Then, the results stage a procession through the same link, each separated from the next by one timestep. From the lower boundary cell of a column, the host computer first receives the result of the bottom row, then that of the penultimate row; this procedure continues until eventually we see the result of the top row. Thus we get the output scheme shown in Figure 16.10.

How can a cell recognise when to change from forwarding items of matrix to passing items of matrix through the lower port? We can solve this task by an automaton combining global control with local control in the cell:

If we send a global signal to all cells at exactly the moment when the last items of and are input to cell , each cell can start a countdown process: in each successive timestep, we decrement a counter initially set to the number of the remaining calculation steps. Thereby cell still has to perform calculations before changing to **propagation mode**. Later, the already mentioned global reset signal switches the cell back to **calculation mode**.

Figure 16.11 presents a systolic array implementing this local/global principle. Basically, the array structure and the communication topology have been preserved. But each cell can run in one of two states now, switched by a *control logic*:

In

*calculation mode*, as before, the result of the addition is written to register*C*. At the same time, the value in register*B*—i.e., the operand used for the multiplication—is forwarded through the lower port of the cell.In

*propagation mode*, registers*B*and*C*are connected in series. In this mode, the only function of the cell is to guide each value received at the upper port down to the lower port, thereby enforcing a*delay*of two timesteps.

The first value output from cell in *propagation mode* is the currently calculated value , stored in register *C*. All further output values are results forwarded from cells above. A formal description of the algorithm implemented in Figure 16.11 is given by the assignment-free system (16.34).

It rests to explain how the control signals in a cell are generated in this model. As a prerequisite, the cell must contain a **state flip-flop** indicating the current *operation mode*. The output of this flip-flop is connected to the control inputs of both multiplexors, see Figure 16.11(b). The global reset signal clears the state flip-flop, as well as the registers *A* and *C*: the cell now works in *calculation mode*.

The global ready signal starts the countdown in all cells, so in every timestep the counter is diminished by 1. The counter is initially set to the precalculated value , dependent on the position of the cell. When the counter reaches zero, the flip-flop is set: the cell switches to *propagation mode*.

If desisting from a direct reset of the register *C*, the last value passed, before the reset, from register *B* to register *C* of a cell can be used as a freely decidable initial value for the next dot product to evaluate in the cell. We then even calculate, as already in the systolic array from Figure 16.3, the more general problem

detailed by the following equation system:

The method sketched in Figure 16.11 still has the following drawbacks:

The systolic array uses global control signals, requiring a high technical accuracy.

Each cell needs a counter with counting register, introducing a considerable hardware expense.

The initial value of the counter varies between the cells. Thus, each cell must be individually designed and implemented.

The input data of any successive problem must wait outside the cells until all results from the current problem have left the systolic array.

These disadvantages can be avoided, if *control signals* are *propagated like data*—meaning a **distributed control**. Therefore, we preserve the connections of the registers *B* and *C* with the multiplexors from Figure 16.11(b), but do not generate any control signals in the cells; also, there will be no global reset signal. Instead, a cell receives the necessary control signal from one of the neighbours, stores it in a new one-bit register *S*, and appropriately forwards it to further neighbouring cells. The primary control signals are generated by the *host computer*, and infused into the systolic array by *boundary cells*, only. Figure 16.12(a) shows the required array structure, Figure 16.12(b) the modified cell structure.

Switching to the *propagation mode* occurs successively down one cell in a column, always delayed by one timestep. The delay introduced by register *S* is therefore sufficient.

Reset to the *calculation mode* is performed via the same control wire, and thus also happens with a delay of one timestep per cell. But since the results sink down at half speed, only, we have to wait sufficiently long with the reset: if a cell is switched to *calculation mode* in timestep , it goes to *propagation mode* in timestep , and is reset back to *calculation mode* in timestep .

So we learned that in a systolic array, distributed control induces a different macroscopic timing behaviour than local/global control. Whereas the systolic array from Figure 16.12 can start the calculation of a new problem (16.35) every timesteps, the systolic array from Figure 16.11 must wait for timesteps. The time difference resp. is called the **period**, its reciprocal being the **throughput**.

System (16.37 and 16.38), divided into two parts during the typesetting, formally describes the relations between distributed control and calculations. We thereby assume an infinite, densely packed sequence of matrix product problems, the additional iteration variable being unbounded. The equation headed *variables with alias* describes but pure identity relations.

**Figure 16.12. Matrix product on a rectangular systolic array, with output of results and distributed control. (a) Array structure. (b) Cell structure.**

Formula (16.39) shows the corresponding space-time matrix. Note that one entry of is not constant but depends on the *problem parameters*:

Interestingly, also the cells in a row switch one timestep later when moving one position to the right. Sacrificing some regularity, we could use this circumstance to relieve the *host computer* by applying control to the systolic array at cell (1,1), only. We therefore would have to change the control scheme in the following way:

**Figure 16.13. Matrix product on a rectangular systolic array, with output of results and distributed control. (a) Array structure. (b) Cell on the upper border. **

Figure 16.13 shows the result of this modification. We now need cells of two kinds: cells on the upper border of the systolic array must be like that in Figure 16.13(b); all other cells would be as before, see Figure 16.13(c). Moreover, the *communication topology* on the upper border of the systolic array would be slightly different from that in the regular area.

The chosen *space-time transformation* widely determines the architecture of the systolic array. Mapping recurrence equations to space-time coordinates yields an explicit view to the *geometric properties* of the systolic array, but gives no real insight into the *function of the cells*. In contrast, the processes performed inside a cell can be directly expressed by a **cell program**. This approach is particularly of interest if dealing with a *programmable systolic array*, consisting of cells indeed controlled by a repetitive program.

Like the **global view**, i.e., the *structure* of the systolic array, the **local view** given by a *cell program* in fact is already fixed by the space-time transformation. But, this local view is only induced implicitly here, and thus, by a further mathematical transformation, an explicit representation must be extracted, suitable as a cell program.

In general, we denote instances of program variables with the aid of **index expressions**, that refer to iteration variables. Take, for instance, the equation

from system (16.3). The instance of the program variable is specified using the index expressions , , and , which can be regarded as functions of the iteration variables .

As we have noticed, the set of iteration vectors from the quantification becomes a set of space-time coordinates when applying a space-time transformation (16.12) with transformation matrix from (16.14),

Since each cell is denoted by space coordinates , and the cell program must refer to the current time , the iteration variables in the index expressions for the program variables are not suitable, and must be translated into the new coordinates . Therefore, using the inverse of the space-time transformation from (16.41), we express the iteration variables as functions of the space-time coordinates ,

The existence of such an inverse transformation is guaranteed if the space-time transformation is injective on the domain—and that it should always be: if not, some instances must be calculated by a cell in the same timestep. In the example, reversibility is guaranteed by the square, non singular matrix , even without referral to the domain. With respect to the time vector and any projection vector , the property is sufficient.

Replacing iteration variables by space-time coordinates, which might be interpreted as a **transformation of the domain**, frequently yields very unpleasant index expressions. Here, for example, from we get

But, by a successive **transformation of the index sets**, we can relabel the instances of the program variables such that the reference to cell and time appears more evident. In particular, it seems worthwhile to transform the equation system back into **output normal form**, i.e., to denote the results calculated during timestep in cell by instances of the program variables. We best gain a real understanding of this approach via an abstract mathematical formalism, that we can fit to our special situation.

Therefore, let

be a quantified equation over a domain , with program variables and . The **index functions**
and generate the instances of the program variables as tuples of index expressions.

By transforming the domain with a function that is injective on , equation (16.43) becomes

where is a function that constitutes an inverse of on . The new index functions are and . Transformations of index sets don't touch the domain; they can be applied to each program variable separately, since only the instances of this program variable are renamed, and in a consistent way. With such renamings and , equation (16.44) becomes

If output normal form is desired, has to be the identity.

In the most simple case (as for our example), is the identity, and is an affine transformation of the form , with constant —the already known *dependence vector*. then can be represented in the same way, with . Transformation of the domains happens by the space-time transformation , with an invertible matrix . For all index transformations, we choose the same . Thus equation (16.45) becomes

For the generation of a *cell program*, we have to know the following information for every timestep: the operation to perform, the source of the data, and the destination of the results—known from assembler programs as `opc`

, `src`

, `dst`

.

The operation to perform (`opc`

) follows directly from the function . For a cell with control, we must also find the timesteps when to perform this individual function . The set of these timesteps, as a function of the space coordinates, can be determined by projecting the set onto the time axis; for general polyhedric with the aid of a *Fourier-Motzkin elimination*, for example.

In system (16.46), we get a new dependence vector , consisting of two components: a (vectorial) spatial part, and a (scalar) timely part. The *spatial* part , as a difference vector, specifies *which* neighbouring cell has calculated the operand. We directly can translate this information, concerning the input of operands to cell , into a port specifier with port position , serving as the `src`

operand of the instruction. In the same way, the cell calculating the operand, with position , must write this value to a port with port position , used as the `dst`

operand in the instruction.

The *timely* part of specifies, as a time difference , *when* the calculation of the operand has been performed. If , this information is irrelevant, because the reading cell always gets the output of the immediately preceding timestep from neighbouring cells. However, for , the value must be buffered for timesteps, either by the *producer* cell , or by the *consumer* cell —or by both, sharing the burden. This need can be realised in the cell program, for example, with copy instructions executed by the producer cell , preserving the value of the operand until its final output from the cell by passing it through registers.

Applying this method to system (16.37 and 16.38), with transformation matrix as in (16.39), yields

The iteration variable l, being relevant only for the input/output scheme, can be set to a fixed value prior to the transformation. The cell program for the systolic array from Figure 16.12, performed once in every timestep, reads as follows:

*
Cell-Program
*

1 2 3 4 5 67`IF`

8 9`THEN`

10`ELSE`

The port specifiers stand for local input/output to/from the cell. For each, a pair of qualifiers is derived from the geometric position of the ports relative to the centre of the cell. Port is situated on the left border of the cell, on the right border; is above the centre, below. Each port specifier can be augmented by a bit range: stands for bit 0 of the port, only; denotes the bits 1 to . The designations without port qualifiers stand for registers of the cell.

By application of matrix from (16.13) to system (16.36), we get

Now the advantages of distributed control become obvious. The cell program for (16.47) can be written with referral to the respective timestep , only. And thus, we need no reaction to global control signals, no counting register, no counting operations, and no coding of the local cell coordinates.

**Exercises**

16.4-1 Specify appropriate input/output schemes for performing, on the systolic arrays presented in Figures 16.11 and 16.12, two evaluations of system (16.36) that follow each other closest in time.

16.4-2 How could we change the systolic array from Figure 16.12, to efficiently support the calculation of matrix products with parameters or ?

16.4-3 Write a cell program for the systolic array from Figure 16.3.

16.4-4 Which throughput allows the systolic array from Figure 16.3 for the assumed values of ? Which for general ?

16.4-5 Modify the systolic array from Figure 16.1 such that the results stored in stationary variables are output through additional links directed half right down, i.e., from cell to cell . Develop an assignment-free equation system functionally equivalent to system (16.36), that is compatible with the extended structure. How looks the resulting input/output scheme? Which period is obtained?