15.7. 15.7 Mesh algorithms

To illustrate another model of computation we present two algorithms solving the prefix problem on meshes.

15.7.1. 15.7.1 Prefix on chain

Let suppose that processor of the chain stores element in its local memory, and after the parallel computations the prefix will be stored in the local memory of . At first we introduce a naive algorithm. Its input is the sequence of elements , and its output is the sequence , containing the prefixes.

Chain-Prefix( )

  1   sends  to    2   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                        3  
                        FOR
                      
                      
                     
                        TO
                      
                        4    
                        DO
                      gets  from , then computes and stores               stores , and sends  to   5   gets  from , then computes and stores  
                  

Saying the truth, this is not a real parallel algorithm.

Theorem 15.18 Algorithm Chain-Prefix determines the prefixes of p elements using a chain in time.

Proof. The cycle in lines 2–5 requires time, line 1 and line 6 requires time.

Since the prefixes can be determined in time using a sequential processor, and , so CHAIN-Prefix is not work-effective.

15.7.2. 15.7.2 Prefix on square

An algorithm, similar to Chain-Prefix , can be developed for a square too.

Let us consider a square of size . We need an indexing of the processors. There are many different indexing schemes, but for the next algorithm Square-Prefix sufficient is the one of the simplest solutions, the row-major indexing scheme, where processor gets the index .

The input and the output are the same, as in the case of Chain-Prefix .

The processors form the processor row and the processors form the processor column . The input stored by the processors of row is denoted by , and the similar output is denoted by .

The algorithm works in 3 rounds. In the first round (lines 1–8) processor rows compute the row-local prefixes (working as processors of Chain-Prefix ). In the second round (lines 9–17) the column computes the prefixes using the results of the first round, and the processors of this column send the computed prefix to the neighbour . Finally in the third round the rows determine the final prefixes.

Square-Prefix( )

  1   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                        2    
                        DO
                      sends  to    3   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                        4    
                        FOR
                      
                      
                     
                        TO
                      
                        5       
                        DO
                      gets  from , then computes and   6          stores , and sends  to    7   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                        8    
                        DO
                      gets  from , then computes and stores    9   sends  to   10   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                       11    
                        FOR
                      
                      
                     
                        TO
                      
                       12       
                        DO
                      gets  from , then computes and stores                 stores , and sends  to  13   gets  from , then computes and stores   14   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                       15    
                        DO
                      send  to   16   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                       17    
                        DO
                      sends  to   18   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        DOWNTO
                      
                       19    
                        FOR
                      
                      
                     
                        TO
                      
                       20       
                        DO
                      gets  from , then computes and  21          stores , and sends  to   22   
                     
                        IN PARALLEL FOR
                      
                      
                     
                        TO
                      
                       23    
                        DO
                      gets  from , then computes and stores  
                  

Theorem 15.19 Algorithm Square-Prefix solves the prefix problem using a square of size , major row indexing in time.

Proof. In the first round lines 1–2 contain 1 parallel operation, lines 3–6 require operations, and line 8 again 1 operation, that is all together operations. In a similar way in the third round lines 18–23 require time units, and in round 2 lines 9–17 require time units. The sum of the necessary time units is .

Example 15.8 Prefix computation on square of size Figure 15.23(a) shows 16 input elements. In the first round Square-Prefix computes the row-local prefixes, part (b) of the figure show the results. Then in the second round only the processors of the fourth column work, and determine the column-local prefixes – results are in part (c) of the figure. Finally in the third round algorithm determines the final results shown in part (d) of the figure.

Figure 15.23.  Prefix computation on square.

Prefix computation on square.

  CHAPTER NOTES  

Basic sources of this chapter are for architectures and models the book of Leopold [ 221 ], and the book of Sima, Fountaine and Kacsuk [ 304 ], for parallel programming the book due to Kumar et al. [ 141 ] and [ 221 ], for parallel algorithms the books of Berman and Paul, [ 41 ] Cormen, Leiserson and Rivest [ 72 ], the book written by Horowitz, Sahni and Rajasekaran [ 167 ] and the book [ 176 ], and the recent book due to Casanova, Legrand and Robert [ 58 ].

The website [ 324 ] contains the Top 500 list, a regularly updated survey of the most powerful computers worldwide [ 324 ]. It contains 42% clusters.

Described classifications of computers are proposed by Flynn [ 113 ], and Leopold [ 221 ]. The Figures 15.1, 15.2, 15.3, 15.4, 15.5, 15.7 are taken from the book of Leopold [ 221 ], the program 15.6 from the book written by Gropp et al. [ 145 ].

The clusters are characterised using the book of Pfister [ 273 ], grids are presented on the base of the book and manuscript of Foster and Kellerman [ 117 ], [ 118 ].

With the problems of shared memory deal the book written by Hwang and Xu [ 172 ], the book due to Kleiman, Shah, and Smaalders [ 199 ], and the textbook of Tanenbaum and van Steen [ 315 ].

Details on concepts as tasks, processes and threads can be found in many textbook, e.g. in [ 303 ], [ 314 ]. Decomposition of the tasks into smaller parts is analysed by Tanenbaum and van Steen [ 315 ].

The laws concerning the speedup were described by Amdahl [ 15 ], Gustafson-Barsis [ 150 ] and Brent [ 47 ]. Kandemir, Ramanujam and Choudray review the different methods of the improvement of locality [ 188 ]. Wolfe [ 346 ] analyses in details the connection between the transformation of the data and the program code. In connection with code optimisation the book published by Kennedy and Allen [ 197 ] is a useful source.

The MPI programming model is presented according to Gropp, Snir, Nitzberg, and Lusk [ 145 ], while the base of the description of the OpenMP model is the paper due to Chandra, Dragum, Kohr, Dror, McDonald and Menon [ 60 ], further a review found on the internet [ 258 ].

Lewis and Berg [ 222 ] discuss pthreads, while Oaks and Wong [ 257 ] the Java threads in details. Description of High Performance Fortran can be found in the book Koelbel et al. [ 204 ]. Among others Wolfe [ 346 ] studied the parallelising compilers.

The concept of PRAM is due to Fortune and Wyllie and is known since 1978 [ 116 ]. BSP was proposed in 1990 by Valiant [ 334 ]. LogP has been suggested as an alternative of BSP by Culler et al. in 1993 [ 78 ]. QSM was introduced in 1999 by Gibbons, Matias and Ramachandran [ 132 ].

The majority of the pseudocode conventions used in Section 15.6 and the description of crossover points and comparison of different methods of matrix multiplication can be found in [ 72 ].

The Readers interested in further programming models, as skeletons, parallel functional programming, languages of coordination and parallel mobile agents, can find a detailed description in [ 221 ]. Further problems and parallel algorithms are analysed in the books of Leighton [ 218 ], [ 219 ] and in the chapter Memory Management of this book [ 28 ] and in the book of Horowitz, Sahni and Rajasekaran [ 167 ]. A model of scheduling of parallel processes is discussed in [ 130 ], [ 174 ], [ 345 ].

Cost-optimal parallel merge is analysed by Wu and Olariu in [ 349 ]. New ideas (as the application of multiple comparisons to get a constant time sorting algoritm) of parallel sorting can be found in the paper of Gararch, Golub and Kruskal [ 126 ].