## 16.5. 16.5 Linear systolic arrays

Figure 16.14.  Bubble sort algorithm on a linear systolic array. (a) Array structure with input/output scheme. (b) Cell structure. Explanations in the sections above heavily focused on two-dimensional systolic arrays, but in principle also apply to one-dimensional systolic arrays, called linear systolic arrays in the sequel. The most relevant difference between both kinds concerns the boundary of the systolic array. Linear systolic arrays can be regarded as consisting of boundary cells, only; under this assumption, input from and output to the host computer needs no special concern. However, the geometry of a linear systolic array provides one full dimension as well as one fictitious dimension, and thus communication along the full-dimensional axis may involve similar questions as in Subsection 16.3.5. Eventually, the boundary of the linear systolic array can also be defined in a radically different way, namely to consist of both end cells, only.

### 16.5.1. 16.5.1 Matrix-vector product

If we set one of the problem parameters or to value 1 for a systolic array as that from Figure 16.1, the matrix product means to multiply a matrix by a vector, from left or right. The two-dimensional systolic array then degenerates to a one-dimensional systolic array. The vector by which to multiply is provided as an input data stream through an end cell of the linear systolic array. The matrix items are input to the array simultaneously, using the complete broadside.

As for full matrix product, results emerge stationary. But now, they either can be drained along the array to one of the end cells, or they are sent directly from the producer cells to the host computer. Both methods result in different control mechanisms, time schemes, and running time.

Now, would it be possible to provide all inputs via end cells? The answer is negative if the running time should be of complexity . Matrix contains items, thus there are items per timestep to read. But the number of items receivable through an end cell during one timestep is bounded. Thus, the input/output data rate—of order , here—may already constrain the possible design space.

### 16.5.2. 16.5.2 Sorting algorithms

For sorting, the task is to bring the elements from a set , subset of a totally ordered basic set , into an ascending order where for . A solution to this problem is described by the following assignment-free equation system, where denotes the maximum in : By completing a projection along direction to a space-time transformation we get the linear systolic array from Figure 16.14, as an implementation of the bubble sort algorithm.

Correspondingly, the space-time matrix would induce another linear systolic array, that implements insertion sort. Eventually, the space-time matrix would lead to still another linear systolic array, this one for selection sort.

For the sorting problem, we have input items, output items, and timesteps. This results in an input/output data rate of order . In contrast to the matrix-vector product from Subsection 16.5.1, the sorting problem with any prescribed input/output data rate in principle allows to perform the communication exclusively through the end cells of a linear systolic array.

Note that, in all three variants of sorting described so far, direct input is necessary to all cells: the values to order for bubble sort, the constant values for insertion sort, and both for selection sort. However, instead of inputting the constants, the cells could generate them, or read them from a local memory.

All three variants require a cell control: insertion sort and selection sort use stationary variables; bubble sort has to switch between the processing of input data and the output of calculated values.

### 16.5.3. 16.5.3 Lower triangular linear equation systems

System (16.53) below describes a localised algorithm for solving the linear equation system , where the matrix is a lower triangular matrix. All previous examples had in common that, apart from copy operations, the same kind of calculation had to be performed for each domain point: fused multiply/add for the matrix algorithms, minimum and maximum for the sorting algorithms. In contrast, system (16.53) contains some domain points where multiply and subtract is required, as well as some others needing division. When projecting system (16.53) to a linear systolic array, depending on the chosen projection direction we get fixed or varying cell functions. Peculiar for projecting along , we see a single cell with divider; all other cells need a multiply/subtract unit. Projection along or yields identical cells, all containing a divider as well as a multiply/subtract unit. Projection vector results in a linear systolic array with three different cell types: both end cells need a divider, only; all other cells contain a multiply/subtract unit, with or without divider, alternatingly. Thus, a certain projection can introduce inhomogeneities into a systolic array—that may be desirable, or not.

Exercises

16.5-1 For both variants of matrix-vector product as in Subsection 16.5.1—output of the results by an end cell versus communication by all cells—specify a suitable array structure with input/output scheme and cell structure, including the necessary control mechanisms.

16.5-2 Study the effects of further projection directions on system (16.53).

16.5-3 Construct systolic arrays implementing insertion sort and selection sort, as mentioned in Subsection 16.5.2. Also draw the corresponding cell structures.

16.5-4 The systolic array for bubble sort from Figure 16.14 could be operated without control by cleverly organising the input streams. Can you find the trick?

16.5-5 What purpose serves the value in system (16.49)? How system (16.49) could be formulated without this constant value? Which consequences this would incur for the systolic arrays described?

 PROBLEMS

16-1 Band matrix algorithms

In Sections 16.1, 16.2, and Subsections 16.5.1, and 16.5.3, we always assumed full input matrices, i.e., each matrix item used could be nonzero in principle. (Though in a lower triangular matrix, items above the main diagonal are all zero. Note, however, that these items are not inputs to any of the algorithms described.)

In contrast, practical problems frequently involve band matrices, cf. Kung/Leiserson []. In such a matrix, most diagonals are zero, left alone a small band around the main diagonal. Formally, we have for all with or , where and are positive integers. The band width, i.e., the number of diagonals where nonzero items may appear, here amounts to .

Now the question arises whether we could profit from the band structure in one or more input matrices to optimise the systolic calculation. One opportunity would be to delete cells doing no useful work. Other benefits could be shorter input/output data streams, reduced running time, or higher throughput.

Study all systolic arrays presented in this chapter for improvements with respect to these criteria.

 CHAPTER NOTES

The term systolic array has been coined by Kung and Leiserson in their seminal paper [].

Karp, Miller, and Winograd did some pioneering work [] for uniform recurrence equations.

Essential stimuli for a theory on the systematic design of systolic arrays have been Rao's PhD dissertation [] and the work of Quinton [].

The contribution of Teich and Thiele [] shows that a formal derivation of the cell control can be achieved by methods very similar to those for a determination of the geometric array structure and the basic cell function.

The up-to-date book by Darte, Robert, and Vivien [] joins advanced methods from compiler design and systolic array design, dealing also with the analysis of data dependences.

The monograph [] still seems to be the most comprehensive work on systolic systems.

Each systolic array can also be modelled as a cellular automaton. The registers in a cell together hold the state of the cell. Thus, a factorised state space is adequate. Cells of different kind, for instance with varying cell functionality or position-dependent cell control, can be described with the aid of further components of the state space.

Each systolic algorithm also can be regarded as a PRAM algorithm with the same timing behaviour. Thereby, each register in a systolic cell corresponds to a PRAM memory cell, and vice versa. The EREW PRAM model is sufficient, because in every timestep exactly one systolic cell reads from this register, and then exactly one systolic cell writes to this register.

Each systolic system also is a special kind of synchronous network as defined by Lynch []. Time complexity measures agree. Communication complexity usually is no topic with systolic arrays. Restriction to input/output through boundary cells, frequently demanded for systolic arrays, also can be modelled in a synchronous network. The concept of failures is not required for systolic arrays.

The book written by Sima, Fountain and Kacsuk [] considers the systolic systems in details.