Computational problems are based on mathematical models and their numerical solution. The numerical solution methods practically always rely on some kind of linearization, resulting in algorithms that require us to solve linear systems of equations and perform matrix-vector multiplication as a core of the iterative solution. Thus, matrix-vector multiplication is a basic operation that can be, if implemented efficiently on the parallel architecture, the most general building block in any numerical algorithm. We define the basic problem to be the computation of the result vector from input matrix , vectors and , as

We call this the *MV* problem. Let be the dimensions of matrix . As every input vector element may contribute to each of the output vector elements, a scalar CPU implementation would contain a double loop, one loop scans the input elements while the other the output elements. If we parallelize the algorithm by assigning output elements to parallel threads, then we obtain a gathering type algorithm where a thread gathers the contributions of all input elements and aggregates them to the thread's single output value. On the other hand, if we assigned parallel threads to input elements, then a thread would compute the contribution of this input element to all output elements, which would be a scatter operation. In case of gathering, threads share only input data but their output is exclusive so no synchronization is needed. In case of scattering, multiple threads may add their contribution to the same output element, so atomic adds are needed, which may result in performance degradation.

An implementation of the matrix-vector multiplication on a scalar processor looks like the following:

void ScalarMV(int N, int M, float* y, const float* A, const float* x, const float* b) { for(int i=0; i≤N; i++) { float yi = b[i]; for(int j=0; j≤M; j++) yi += A[i * M + j] * x[j]; y[i] = yi; } }

The first step of porting this algorithm to a parallel machine is to determine what a single thread would do from this program. From the options of gathering and scattering, we should prefer gathering since that automatically eliminates the problems of non-exclusive write operations. In a gathering type solution, a thread computes a single element of vector and thus we need to start threads. A GPU can launch a practically unlimited number of threads that are grouped in thread blocks. Threads of a block are assigned to the same multiprocessor. So the next design decision is how the threads are distributed in blocks. A multiprocessor typically executes 32 threads in parallel, so the number of threads in a block should be some multiple of 32. When the threads are halted because of a slow memory access, a hardware scheduler tries to continue the processing of other threads, so it is wise to assign more than 32 threads to a multiprocessor to always have threads that are ready to run. However, increasing the number of threads in a single block may also mean that at the end we have just a few blocks, i.e. our program will run just on a few multiprocessors. Considering these, we assign 256 threads to a single block and hope that exceeds the number of multiprocessors and thus we fully utilize the parallel hardware.

There is a slight problem if is not a multiple of . We should assign the last elements of the vector to some processors as well, so the thread block number should be the ceiling of . As a result of this, we shall have threads that are not associated with vector elements. It is not a problem if the extra threads can detect it and cause no harm, e.g. they do not over-index the output array.

Similarly to the discussed vector processing model, a thread should be aware which output element it is computing. The CUDA library provides implicit input parameters that encode this information: *blockIdx* is the index of the thread block, *blockDim* is the number of threads in a block, and *threadIdx* is the index of the thread inside the block.

The program of the CUDA kernel computing a single element of the output vector is now a part of a conventional CPU program:

__global__ void cudaSimpleMV(int N, int M, float* y, float* A, float* x, float* b) { // Determine element to process from thread and block indices int i = blockIdx.x * blockDim.x + threadIdx.x; if(i ≤ N) { // if the index is out of the range of the output array, skip. float yi = b[i]; for(int j=0; j≤M; j++) yi += A[i * M + j] * x[j]; y[i] = yi; } }

The **global** keyword tells the compiler that this function will run not on the CPU but on the GPU and it may be invoked from the CPU as well. The parameters are passed according to the normal C syntax. The only special feature is the use of the implicit parameters to compute the identification number of this thread, which is the index of the output array.

The kernels are started from a CPU program that sets the parameters and also defines the number of thread blocks and the number of threads inside a block.

__host__ void run_cudaSimpleMV() { int threadsPerBlock = 256; // number of threads per block int blockNum = (N + threadsPerBlock - 1)/threadsPerBlock; // number of blocks cudaSimpleMV≤≤≤blockNum, threadsPerBlock≥≥≥(N, M, y, A, x, b); }

The compiler will realize that this function runs on the CPU by reading the **host** keyword. The parallel threads are started like a normal C function call with the exception of the *≤blockNum, threadsPerBlock≥* tag, which defines how many threads should be started and how they are distributed among the multiprocessors.

So far, we assigned matrix rows to parallel threads and computed scalar product serially inside threads. If the number of matrix rows is less than the number of parallel scalar processors, this amount of parallelization is not enough to supply all processing units with work to do, and the execution of individual threads will be lengthy. Reformulating the scalar product computation is a well known, but tougher parallelization problem, as the additions cannot be executed independently, and we require a single scalar to be written for every row of the matrix. However, parts of the summation can be executed independently, and then the results added. This is a classic example of reduction. It is required that the threads whose results are to be added both finish execution and write their results to where they are accessible for the thread that needs to add them. Thus, we use **thread synchronization** and shared memory available only for the threads of the same block.

Let us assume first—unrealistically—that we can have threads processing a row and the shared memory can hold floating point values. Let be the vector of length residing in shared memory. Then, every thread can compute one element as . Finally, elements of must be reduced by summation. Let us further assume that . The reduction can be carried out in steps, terminating half of the threads, while each surviving thread adds the value in computed by a terminated one to its own. The final remaining thread outputs the value to the global memory.

#define M THE_NUMBER_OF_MATRIX_COLUMNS __global__ void cudaReduceMV(int N, float* y, float* A, float* x, float* b) { int i = blockIdx.x; int j = threadIdx.x; __shared__ float Q[M]; // in the shader memory inside a multiprocessor Q[j] = A[i * M + j] * x[j]; // a parallel part of matrix-vector multiplication for(int stride = M / 2; stride ≥ 0; stride ≥≥= 1) // reduction { __syncthreads(); // wait until all other threads of the block arrive this point if(j + stride ≤ M) Q[j] += Q[j + stride]; } if(j == 0) // reduced to a single element y[i] = Q[0] + b[i]; } __host__ void run_cudaReduceMV() { cudaReduceMV≤≤≤ N, M ≥≥≥(N, y, A, x, b); }

For practical matrix dimensions (), neither the number of possible threads of a single multiprocessor nor the size of the shared memory is enough to process all elements in parallel. In our next example, we use a single block of threads with limited size to process a large matrix. First, we break the output vector into segments of size . Elements within such a segment are evaluated in parallel, then the threads proceed to the next segment. Second, for every scalar product computation, we break the vectors and into segments of length . We maintain a shared vector of length for every row being processed in parallel. We can compute the elementwise product of the and segments in parallel, and add it to . As rows are being processed by threads each, the block will consist of threads. From one thread's perspective this means it has to loop over with a stride of , and for every such element in , loop over and with a stride of . Also for every element in , the contents of must be summed by reduction as before. The complete kernel which works with large matrices would then be:

__global__ void cudaLargeMV(int N, int M, float* y, float* A, float* x, float* b) { __shared__ float Q[T * Z]; // stored in the shared memory inside a multiprocessor int t = threadIdx.x / Z; int z = threadIdx.x % Z; for(int i = t; i ≤ N; i += T) { Q[t * Z + z] = 0; for(int j = z; j ≤ M; j += Z) Q[t * Z + z] += A[i * M + j] * x[j]; for(int stride = Z / 2; stride ≥ 0; stride ≥≥= 1) { __syncthreads(); if(z + stride ≤ Z) Q[t * Z + z] += Q[t * Z + z + stride]; } if(z == 0) y[i] = Q[t * Z + 0] + b[i]; } } __host__ void run_cudaLargeMV() { cudaReduceMV≤≤≤ 1, T*Z ≥≥≥(N, M, y, A, x, b); }

This can easily be extended to make use of multiple thread blocks by restricting the outer loop to only a fraction of the matrix rows based on the *blockIdx* parameter.

The above algorithm uses shared memory straightforwardly and allows us to align memory access of threads through a proper choice of block sizes. However, every element of vector must be read once for the computation of every row. We can improve on this if we read values of into the shared memory and have threads in one block operate on multiple rows of the matrix. This, however, means we can use less shared memory per line to parallelize summation. The analysis of this trade-off is beyond the scope of this chapter, but a block size of has been proposed in [ 88 ]. With such a strategy it is also beneficial to access matrix as a texture, as data access will exhibit 2D locality, supported by texture caching hardware.

Even though matrix-vector multiplication is a general mathematical formulation for a wide range of computational problems, the arising matrices are often large, but sparse. In case of sparse matrices, the previously introduced matrix-vector multiplication algorithms will not be efficient as they explicitly compute multiplication with zero elements. Sparse matrix representations and MV algorithms are discussed in [ 26 ].

**Exercises**

28.6-1 Implement matrix-vector multiplication for large matrices in CUDA. Compare results to a CPU implementation.

28.6-2 Implement an inverse iteration type Julia set renderer. The Julia set is the attractor of the iteration where and are complex numbers. Inverse iteration starts from a fixed point of the iteration formula, and iterates the inverse mapping, by randomly selecting either or from the two possibilities. Threads must use pseudo-random generators that are initialized with different seeds. Note that CUDA has no built-in random number generator, so implement one in the program.