To illustrate another model of computation we present two algorithms solving the prefix problem on meshes.
Let suppose that processor of the chain stores element in its local memory, and after the parallel computations the prefix will be stored in the local memory of . At first we introduce a naive algorithm. Its input is the sequence of elements , and its output is the sequence , containing the prefixes.
ChainPrefix(
)
1 sends to 2IN PARALLEL FOR
TO
3FOR
TO
4DO
gets from , then computes and stores stores , and sends to 5 gets from , then computes and stores
Saying the truth, this is not a real parallel algorithm.
Theorem 15.18
Algorithm
ChainPrefix
determines the prefixes of p elements using a chain in time.
Proof. The cycle in lines 2–5 requires time, line 1 and line 6 requires time.
Since the prefixes can be determined in time using a sequential processor, and , so
CHAINPrefix
is not workeffective.
An algorithm, similar to
ChainPrefix
, can be developed for a square too.
Let us consider a square of size . We need an indexing of the processors. There are many different indexing schemes, but for the next algorithm
SquarePrefix
sufficient is the one of the simplest solutions, the rowmajor indexing scheme, where processor gets the index .
The input and the output are the same, as in the case of
ChainPrefix
.
The processors form the processor row and the processors form the processor column . The input stored by the processors of row is denoted by , and the similar output is denoted by .
The algorithm works in 3 rounds. In the first round (lines 1–8) processor rows compute the rowlocal prefixes (working as processors of
ChainPrefix
). In the second round (lines 9–17) the column computes the prefixes using the results of the first round, and the processors of this column send the computed prefix to the neighbour . Finally in the third round the rows determine the final prefixes.
SquarePrefix(
)
1IN PARALLEL FOR
TO
2DO
sends to 3IN PARALLEL FOR
TO
4FOR
TO
5DO
gets from , then computes and 6 stores , and sends to 7IN PARALLEL FOR
TO
8DO
gets from , then computes and stores 9 sends to 10IN PARALLEL FOR
TO
11FOR
TO
12DO
gets from , then computes and stores stores , and sends to 13 gets from , then computes and stores 14IN PARALLEL FOR
TO
15DO
send to 16IN PARALLEL FOR
TO
17DO
sends to 18IN PARALLEL FOR
DOWNTO
19FOR
TO
20DO
gets from , then computes and 21 stores , and sends to 22IN PARALLEL FOR
TO
23DO
gets from , then computes and stores
Theorem 15.19
Algorithm
SquarePrefix
solves the prefix problem using a square of size , major row indexing in time.
Proof. In the first round lines 1–2 contain 1 parallel operation, lines 3–6 require operations, and line 8 again 1 operation, that is all together operations. In a similar way in the third round lines 18–23 require time units, and in round 2 lines 9–17 require time units. The sum of the necessary time units is .
Example 15.8
Prefix computation on square of size
Figure 15.23(a) shows 16 input elements. In the first round
SquarePrefix
computes the rowlocal prefixes, part (b) of the figure show the results. Then in the second round only the processors of the fourth column work, and determine the columnlocal prefixes – results are in part (c) of the figure. Finally in the third round algorithm determines the final results shown in part (d) of the figure.
CHAPTER NOTES

Basic sources of this chapter are for architectures and models the book of Leopold [ 221 ], and the book of Sima, Fountaine and Kacsuk [ 304 ], for parallel programming the book due to Kumar et al. [ 141 ] and [ 221 ], for parallel algorithms the books of Berman and Paul, [ 41 ] Cormen, Leiserson and Rivest [ 72 ], the book written by Horowitz, Sahni and Rajasekaran [ 167 ] and the book [ 176 ], and the recent book due to Casanova, Legrand and Robert [ 58 ].
The website [ 324 ] contains the Top 500 list, a regularly updated survey of the most powerful computers worldwide [ 324 ]. It contains 42% clusters.
Described classifications of computers are proposed by Flynn [ 113 ], and Leopold [ 221 ]. The Figures 15.1, 15.2, 15.3, 15.4, 15.5, 15.7 are taken from the book of Leopold [ 221 ], the program 15.6 from the book written by Gropp et al. [ 145 ].
The clusters are characterised using the book of Pfister [ 273 ], grids are presented on the base of the book and manuscript of Foster and Kellerman [ 117 ], [ 118 ].
With the problems of shared memory deal the book written by Hwang and Xu [ 172 ], the book due to Kleiman, Shah, and Smaalders [ 199 ], and the textbook of Tanenbaum and van Steen [ 315 ].
Details on concepts as tasks, processes and threads can be found in many textbook, e.g. in [ 303 ], [ 314 ]. Decomposition of the tasks into smaller parts is analysed by Tanenbaum and van Steen [ 315 ].
The laws concerning the speedup were described by Amdahl [ 15 ], GustafsonBarsis [ 150 ] and Brent [ 47 ]. Kandemir, Ramanujam and Choudray review the different methods of the improvement of locality [ 188 ]. Wolfe [ 346 ] analyses in details the connection between the transformation of the data and the program code. In connection with code optimisation the book published by Kennedy and Allen [ 197 ] is a useful source.
The MPI programming model is presented according to Gropp, Snir, Nitzberg, and Lusk [ 145 ], while the base of the description of the OpenMP model is the paper due to Chandra, Dragum, Kohr, Dror, McDonald and Menon [ 60 ], further a review found on the internet [ 258 ].
Lewis and Berg [ 222 ] discuss pthreads, while Oaks and Wong [ 257 ] the Java threads in details. Description of High Performance Fortran can be found in the book Koelbel et al. [ 204 ]. Among others Wolfe [ 346 ] studied the parallelising compilers.
The concept of PRAM is due to Fortune and Wyllie and is known since 1978 [ 116 ]. BSP was proposed in 1990 by Valiant [ 334 ]. LogP has been suggested as an alternative of BSP by Culler et al. in 1993 [ 78 ]. QSM was introduced in 1999 by Gibbons, Matias and Ramachandran [ 132 ].
The majority of the pseudocode conventions used in Section 15.6 and the description of crossover points and comparison of different methods of matrix multiplication can be found in [ 72 ].
The Readers interested in further programming models, as skeletons, parallel functional programming, languages of coordination and parallel mobile agents, can find a detailed description in [ 221 ]. Further problems and parallel algorithms are analysed in the books of Leighton [ 218 ], [ 219 ] and in the chapter Memory Management of this book [ 28 ] and in the book of Horowitz, Sahni and Rajasekaran [ 167 ]. A model of scheduling of parallel processes is discussed in [ 130 ], [ 174 ], [ 345 ].
Costoptimal parallel merge is analysed by Wu and Olariu in [ 349 ]. New ideas (as the application of multiple comparisons to get a constant time sorting algoritm) of parallel sorting can be found in the paper of Gararch, Golub and Kruskal [ 126 ].