Systolic arrays probably constitute a perfect kind of special purpose computer. In their simplest appearance, they may provide only one operation, that is repeated over and over again. Yet, systolic arrays show an abundance of practice-oriented applications, mainly in fields dominated by iterative procedures: numerical mathematics, combinatorial optimisation, linear algebra, algorithmic graph theory, image and signal processing, speech and text processing, et cetera.
For a systolic array can be tailored to the structure of its one and only algorithm thus accurately! So that time and place of each executed operation are fixed once and for all. And communicating cells are permanently and directly connected, no switching required. The algorithm has in fact become hardwired. Systolic algorithms in this respect are considered to be hardware algorithms. Please note that the term systolic algorithms usually does not refer to a set of concrete algorithms for solving a single specific computational problem, as for instance sorting. And this is quite in contrast to terms like sorting algorithms. Rather, systolic algorithms constitute a special style of specification, programming, and computation. So algorithms from many different areas of application can be systolic in style. But probably not all well-known algorithms from such an area might be suited to systolic computation.
Hence, this chapter does not intend to present all systolic algorithms, nor will it introduce even the most important systolic algorithms from any field of application. Instead, with a few simple but typical examples, we try to lay the foundations for the Readers' general understanding of systolic algorithms. The rest of this chapter is organised as follows: Section 16.1 shows some basic concepts of systolic systems by means of an introductory example. Section 16.2 explains how systolic arrays formally emerge from space-time transformations. Section 16.3 deals with input/output schemes. Section 16.4 is devoted to all aspects of control in systolic arrays. In Section 16.5 we study the class of linear systolic arrays, raising further questions.
The designation systolic follows from the operational principle of the systolic architecture. The systolic style is characterised by an intensive application of both pipelining and parallelism, controlled by a global and completely synchronous clock. Data streams pulsate rhythmically through the communication network, like streams of blood are driven from the heart through the veins of the body. Here, pipelining is not constrained to a single space axis but concerns all data streams possibly moving in different directions and intersecting in the cells of the systolic array.
A systolic system typically consists of a host computer, and the actual systolic array. Conceptionally, the host computer is of minor importance, just controlling the operation of the systolic array and supplying the data. The systolic array can be understood as a specialised network of cells rapidly performing data-intensive computations, supported by massive parallelism. A systolic algorithm is the program collaboratively executed by the cells of a systolic array.
Systolic arrays may appear very differently, but usually share a couple of key features: discrete time scheme, synchronous operation, regular (frequently two-dimensional) geometric layout, communication limited to directly neighbouring cells, and spartan control mechanisms.
In this section, we explain fundamental phenomena in context of systolic arrays, driven by a running example. A computational problem usually allows several solutions, each implemented by a specific systolic array. Among these, the most attractive designs (in whatever respect) may be very complex. Note, however, that in this educational text we are less interested in advanced solutions, but strive to present important concepts compactly and intuitively.
Figure 16.1 shows a rectangular systolic array consisting of 15 cells for multiplying a matrix by an matrix . The parameter is not reflected in the structure of this particular systolic array, but in the input scheme and the running time of the algorithm.
The input scheme depicted is based on the special choice of parameter . Therefore, Figure 16.1 gives a solution to the following problem instance:
The cells of the systolic array can exchange data through links, drawn as arrows between the cells in Figure 16.1(a). Boundary cells of the systolic array can also communicate with the outside world. All cells of the systolic array share a common connection pattern for communicating with their environment. The completely regular structure of the systolic array (placement and connection pattern of the cells) induces regular data flows along all connecting directions.
Figure 16.1(b) shows the internal structure of a cell. We find a multiplier, an adder, three registers, and four ports, plus some wiring between these units. Each port represents an interface to some external link that is attached to the cell. All our cells are of the same structure.
Each of the registers A, B, C can store a single data item. The designations of the registers are suggestive here, but arbitrary in principle. Registers A and B get their values from input ports, shown in Figure 16.1(b) as small circles on the left resp. upper border of the cell.
The current values of registers A and B are used as operands of the multiplier and, at the same time, are passed through output ports of the cell, see the circles on the right resp. lower border. The result of the multiplication is supplied to the adder, with the second operand originating from register C. The result of the addition eventually overwrites the past value of register C.
Figure 16.1. Rectangular systolic array for matrix product. (a) Array structure and input scheme. (b) Cell structure.
The 15 cells of the systolic array are organised as a rectangular pattern of three rows by five columns, exactly as with matrix . Also, these dimensions directly correspond to the number of rows of matrix and the number of columns of matrix . The size of the systolic array, therefore, corresponds to the size of some data structures for the problem to solve. If we had to multiply an matrix by an matrix in the general case, then we would need a systolic array with rows and columns.
The quantities are parameters of the problem to solve, because the number of operations to perform depends on each of them; they are thus problem parameters. The size of the systolic array, in contrast, depends on the quantities and , only. For this reason, and become also array parameters, for this particular systolic array, whereas is not an array parameter.
Remark. For matrix product, we will see another systolic array in Section 16.2, with dimensions dependent on all three problem parameters .
An systolic array as shown in Figure 16.1 would also permit to multiply an matrix by an matrix , where and . This is important if we intend to use the same systolic array for the multiplication of matrices of varying dimensions. Then we would operate on a properly dimensioned rectangular subarray, only, consisting of rows and columns, and located, for instance, in the upper left corner of the complete array. The remaining cells would also work, but without any contribution to the solution of the whole problem; they should do no harm, of course.
Now let's assume that we want to assign unique space coordinates to each cell of a systolic array, for characterising the geometric position of the cell relative to the whole array. In a rectangular systolic array, we simply can use the respective row and column numbers, for instance. The cell marked with in Figure 16.1 thus would get the coordinates (1,1), the cell marked with would get the coordinates (1,2), cell would get (2,1), and so on. For the remainder of this section, we take space coordinates constructed in such a way for granted.
In principle it does not matter where the coordinate origin lies, where the axes are pointing to, which direction in space corresponds to the first coordinate, and which to the second. In the system presented above, the order of the coordinates has been chosen corresponding to the designation of the matrix components. Thus, the first coordinate stands for the rows numbered top to bottom from position 1, the second component stands for the columns numbered left to right, also from position 1.
Of course, we could have made a completely different choice for the coordinate system. But the presented system perfectly matches our particular systolic array: the indices of a matrix element computed in a cell agree with the coordinates of this cell. The entered rows of the matrix carry the same number as the first coordinate of the cells they pass; correspondingly for the second coordinate, concerning the columns of the matrix $B$. All links (and thus all passing data flows) are in parallel to some axis, and towards ascending coordinates.
It is not always so clear how expressive space coordinates can be determined; we refer to the systolic array from Figure 16.3(a) as an example. But whatsoever the coordinate system is chosen: it is important that the regular structure of the systolic array is obviously reflected in the coordinates of the cells. Therefore, almost always integral coordinates are used. Moreover, the coordinates of cells with minimum Euclidean distance should differ in one component, only, and then with distance 1.
Each active cell from Figure 16.1 computes exactly the element of the result matrix . Therefore, the cell must evaluate the dot product
This is done iteratively: in each step, a product is calculated and added to the current partial sum for . Obviously, the partial sum has to be cleared—or set to another initial value, if required—before starting the accumulation. Inspired by the classical notation of imperative programming languages, the general proceeding could be specified in pseudocode as follows:
If , we have to perform multiplications, additions, and assignments, each. Hence the running time of this algorithm is of order for any sequential processor.
The sum operator is one of the so-called generic operators, that combine an arbitrary number of operands. In the systolic array from Figure 16.1, all additions contributing to a particular sum are performed in the same cell. However, there are plenty of examples where the individual operations of a generic operator are spread over several cells—see, for instance, the systolic array from Figure 16.3.
Remark. Further examples of generic operators are: product, minimum, maximum, as well as the Boolean operators
Thus, generic operators usually have to be serialised before the calculations to perform can be assigned to the cells of the systolic array. Since the distribution of the individual operations to the cells is not unique, generic operators generally must be dealt with in another way than simple operators with fixed arity, as for instance the dyadic addition.
Instead of using an imperative style as in algorithm
, we better describe systolic programs by an assignment-free notation which is based on an equational calculus. Thus we avoid side effects and are able to directly express parallelism. For instance, we may be bothered about the reuse of the program variable from algorithm
. So, we replace with a sequence of instances
, that stand for the successive states of . This approach yields a so-called recurrence equation We are now able to state the general matrix product from algorithm
by the following assignment-free expressions:
System (16.1) explicitly describes the fine structure of the executed systolic algorithm. The first equation specifies all input data, the third equation all output data. The systolic array implements these equations by input/output operations. Only the second equation corresponds to real calculations.
Each equation of the system is accompanied, on the right side, by a quantification. The quantification states the set of values the iteration variables and (and, for the second equation, also ) should take. Such a set is called a domain. The iteration variables of the second equation can be combined in an iteration vector . For the input/output equations, the iteration vector would consist of the components and , only. To get a closed representation, we augment this vector by a third component , that takes a fixed value. Inputs then are characterised by , outputs by . Overall we get the following system:
Note that although the domains for the input/output equations now are formally also of dimension 3, as a matter of fact they are only two-dimensional in the classical geometric sense.
From equations as in system (16.2), we directly can infer the atomic entities to perform in the cells of the systolic array. We find these operations by instantiating each equation of the system with all points of the respective domain. If an equation contains several suboperations corresponding to one point of the domain, these are seen as a compound operation, and are always processed together by the same cell in one working cycle.
In the second equation of system (16.2), for instance, we find the multiplication and the successive addition . The corresponding elementary operations—multiplication and addition—are indeed executed together as a multiply-add compound operation by the cell of the systolic array shown in Figure 16.1(b).
Now we can assign a designation to each elementary operation, also called coordinates. A straight-forward method to define suitable coordinates is provided by the iteration vectors used in the quantifications.
Applying this concept to system (16.1), we can for instance assign the tuple of coordinates to the calculation . The same tuple is assigned to the input operation , but with setting . By the way: all domains are disjoint in this example.
If we always use the iteration vectors as designations for the calculations and the input/output operations, there is no further need to distinguish between coordinates and iteration vectors. Note, however, that this decision also mandates that all operations belonging to a certain point of the domain together constitute a compound operation—even when they appear in different equations and possibly are not related. For simplicity, we always use the iteration vectors as coordinates in the sequel.
The various elementary operations always happen in discrete timesteps in the systolic cells. All these timesteps driving a systolic array are of equal duration. Moreover, all cells of a systolic array work completely synchronous, i.e., they all start and finish their respective communication and calculation steps at the same time. Successive timesteps controlling a cell seamlessly follow each other.
Remark. But haven't we learned from Albert Einstein that strict simultaneity is physically impossible? Indeed, all we need here are cells that operate almost simultaneously. Technically this is guaranteed by providing to all systolic cells a common clock signal that switches all registers of the array. Within the bounds of the usually achievable accuracy, the communication between the cells happens sufficiently synchronised, and thus no loss of data occurs concerning send and receive operations. Therefore, it should be justified to assume a conceptional simultaneity for theoretical reasoning.
Now we can slice the physical time into units of a timestep, and number the timesteps consecutively. The origin on the time axis can be arbitrarily chosen, since time is running synchronously for all cells. A reasonable decision would be to take as the time of the first input in any cell. Under this regime, the elementary compound operation of system (16.2) designated by would be executed at time . On the other hand, it would be evenly justified to assign the time to the coordinates ; because this change would only induce a global time shift by three time units.
So let us assume for the following that the execution of an instance starts at time . The first calculation in our example then happens at time , the last at time . The running time thus amounts to timesteps.
Normally, the data needed for calculation by the systolic array initially are not yet located inside the cells of the array. Rather, they must be infused into the array from the outside world. The outside world in this case is a host computer, usually a scalar control processor accessing a central data storage. The control processor, at the right time, fetches the necessary data from the storage, passes them to the systolic array in a suitable way, and eventually writes back the calculated results into the storage.
Each cell must access the operands and during the timestep concerning index value . But only the cells of the leftmost column of the systolic array from Figure 16.1 get the items of the matrix directly as input data from the outside world. All other cells must be provided with the required values from a neighbouring cell. This is done via the horizontal links between neighbouring cells, see Figure 16.1(a). The item successively passes the cells . Correspondingly, the value enters the array at cell , and then flows through the vertical links, reaching the cells up to cell . An arrowhead in the figure shows in which direction the link is oriented.
Frequently, it is considered problematic to transmit a value over large distances within a single timestep, in a distributed or parallel architecture. Now suppose that, in our example, cell got the value during timestep from cell , or from the outside world. For the reasons described above, is not passed from cell to cell in the same timestep , but one timestep later, i.e., at time . This also holds for the values . The delay is visualised in the detail drawing of the cell from Figure 16.1(b): input data flowing through a cell always pass one register, and each passed register induces a delay of exactly one timestep.
Remark. For systolic architectures, it is mandatory that any path between two cells contains at least one register—even when forwarding data to a neighbouring cell, only. All registers in the cells are synchronously switched by the global clock signal of the systolic array. This results in the characteristic rhythmical traffic on all links of the systolic array. Because of the analogy with pulsating veins, the medical term systole has been reused for the name of the concept.
To elucidate the delayed forwarding of values, we augment system (16.1) with further equations. Repeatedly used values like are represented by separate instances, one for each access. The result of this proceeding—that is very characteristic for the design of systolic algorithms—is shown as system (16.3).
Each of the partial sums in the progressive evaluation of is calculated in a certain timestep, and then used only once, namely in the next timestep. Therefore, cell must provide a register (named C in Figure 16.1(b)) where the value of can be stored for one timestep. Once the old value is no longer needed, the register holding can be overwritten with the new value . When eventually the dot product is completed, the register contains the value , that is the final result . Before performing any computation, the register has to be cleared, i.e., preloaded with a zero value—or any other desired value.
In contrast, there is no need to store the values and permanently in cell . As we can learn from Figure 16.1(a), each row of the matrix is delayed by one timestep with respect to the preceding row. And so are the columns of the matrix . Thus the values and arrive at cell exactly when the calculation of is due. They are put to the registers A resp. B, then immediately fetched from there for the multiplication, and in the same cycle forwarded to the neighbouring cells. The values and are of no further use for cell after they have been multiplied, and need not be stored there any longer. So A and B are overwritten with new values during the next timestep.
It should be obvious from this exposition that we urgently need to make economic use of the memory contained in a cell. Any calculation and any communication must be coordinated in space and time in such a way that storing of values is limited to the shortest-possible time interval. This goal can be achieved by immediately using and forwarding the received values. Besides the overall structure of the systolic array, choosing an appropriate input/output scheme and placing the corresponding number of delays in the cells essentially facilitates the desired coordination. Figure 16.1(b) in this respect shows the smallest possible delay by one timestep.
Geometrically, the input scheme of the example resulted from skewing the matrices and . Thereby some places in the input streams for matrix became vacant and had to be filled with zero values; otherwise, the calculation of the would have been garbled. The input streams in length depend on the problem parameter .
As can been seen in Figure 16.1, the items of matrix are calculated stationary, i.e., all additions contributing to an item happen in the same cell. Stationary variables don't move at all during the calculation in the systolic array. Stationary results eventually must be forwarded to a border of the array in a supplementary action for getting delivered to the outside world. Moreover, it is necessary to initialise the register for item . Performing these extra tasks requires a high expenditure of runtime and hardware. We will further study this problem in Section 16.4.
The characteristic operating style with globally synchronised discrete timesteps of equal duration and the strict separation in time of the cells by registers suggest systolic arrays to be special cases of pipelined systems. Here, the registers of the cells correspond to the well-known pipeline registers. However, classical pipelines come as linear structures, only, whereas systolic arrays frequently extend into more spatial dimensions—as visible in our example. A multi-dimensional systolic array can be regarded as a set of interconnected linear pipelines, with some justification. Hence it should be apparent that basic properties of one-dimensional pipelining also apply to multi-dimensional systolic arrays.
A typical effect of pipelining is the reduced utilisation at startup and during shut-down of the operation. Initially, the pipe is empty, no pipeline stage active. Then, the first stage receives data and starts working; all other stages are still idle. During the next timestep, the first stage passes data to the second stage and itself receives new data; only these two stages do some work. More and more stages become active until all stages process data in every timestep; the pipeline is now fully utilised for the first time. After a series of timesteps at maximum load, with duration dependent on the length of the data stream, the input sequence ceases; the first stage of the pipeline therefore runs out of work. In the next timestep, the second stage stops working, too. And so on, until eventually all stages have been fallen asleep again. Phases of reduced activity diminish the average performance of the whole pipeline, and the relative contribution of this drop in productivity is all the worse, the more stages the pipeline has in relation to the length of the data stream.
We now study this phenomenon to some depth by analysing the two-dimensional systolic array from Figure 16.1. As expected, we find a lot of idling cells when starting or finishing the calculation. In the first timestep, only cell performs some useful work; all other cells in fact do calculations that work like null operations—and that's what they are supposed to do in this phase. In the second timestep, cells and come to real work, see Figure 16.2(a). Data is flooding the array until eventually all cells are doing work. After the last true data item has left cell , the latter is no longer contributing to the calculation but merely reproduces the finished value of . Step by step, more and more cells drop off. Finally, only cell makes a last necessary computation step; Figure 16.2(b) shows this concluding timestep.
Figure 16.2. Two snapshots for the systolic array from Figure 16.1.
16.1-1 What must be changed in the input scheme from Figure 16.1(a) to multiply a matrix by a matrix on the same systolic array? Could the calculations be organised such that the result matrix would emerge in the lower right corner of the systolic array?
16.1-2 Why is it necessary to clear spare slots in the input streams for matrix , as shown in Figure 16.1? Why haven't we done the same for matrix also?
16.1-3 If the systolic array from Figure 16.1 should be interpreted as a pipeline: how many stages would you suggest to adequately describe the behaviour?