Imagining the GPU as a vector processor is a simple but efficient application of the GPU hardware in general parallel processing. If the algorithm is suitable for vector processing, then this approach is straightforward. However, some algorithms are not good candidates for vector processing, but can still be efficiently executed by the GPU. In this section, we review the basic approaches that can extend the vector processing framework to make the GPU applicable for a wider range of algorithms.

Vector processors are usually SIMD machines, which means that they execute not only the same program for different vector elements but always the very same machine instruction at a time. It means that vector operations cannot contain data dependent conditionals or loop lengths depending on the actual data. There is only one control sequence in a SIMD parallel program.

Of course, writing programs without if conditionals and using only constants as loop cycle numbers are severe limitations, which significantly affects the program structure and the ease of development. Early GPU shaders were also SIMD type processors and placed the burden of eliminating all conditionals from the program on the shoulder of the programmer. Current GPUs and their compilers solve this problem automatically, thus, on the programming level, we can use conditionals and variable length loops as if the shaders were MIMD computers. On execution level, additional control logic makes it possible that execution paths of scalar units diverge: in this case it is still a single instruction which is executed at a time, possibly with some scalar units being idle. Operations of different control paths are serialized so that all of them are completed. The overhead of serialization makes performance strongly dependent on the coherence of execution paths, but many transistors of control logic can be spared for more processing units.

The trick of executing all branches of conditionals with possibly disabled writes is called **predication**. Suppose that our program has an if statement like

if (condition(i)) { F( ); } else { G( ); }

Depending on the data, on some processors the may be true, while it is false on other processors, thus our vector machine would like to execute function of the first branch in some processors while it should evaluate function of the second branch in other processors. As in SIMD there can be only one control path, the parallel system should execute both paths and disable writes when the processor is not in a valid path. This method converts the original program to the following conditional free algorithm:

enableWrite = condition(i); F( ); enableWrite = !enableWrite; G( );

This version does not have conditional instructions so it can be executed by the SIMD machine. However, the computation time will be the the sum of computation times of the two functions.

This performance bottleneck can be attacked by decomposing the computation into multiple passes and by the exploitation of the early -culling feature. The early -cull compares the value of the fragment with the content of the depth buffer, and if it is smaller than the stored value, the fragment shader is not called for this fragment but the fragment processor is assigned to another data element. The early z-cull is enabled automatically if we execute fragment programs that do not modify the fragment's coordinate (this is the case in all examples discussed so far).

To exploit this feature, we decompose the computation into three passes. In the first pass, only the *condition* is evaluated and the depth buffer is initialized with the values. Recall that if the value is not modified, our full screen quad is on the plane in normalized device space, so it will be on the plane in screen space. Thus, to allow a discrimination according to the condition, we can set values in the range if the condition is true and in if it is false.

The fragment shader of the first pass computes just the condition values and stores them in the depth buffer:

float FragmentShaderCondition( in float2 index : WPOS, // target pixel coordinates uniform samplerRECT Input, // input vector ) : DEPTH // the output goes to the depth buffer { bool condition = ComputeCondition( texRECT(Input, index) ); return (condition) ? 0.8 : 0.2; // 0.8 is greater than 0.5; 0.2 is smaller than 0.5 }

Then we execute two passes for the evaluation of functions and , respectively. In the first pass, the fragment shader computes and the depth comparison is set to pass those fragments where their coordinate is less than the depth value stored in the depth buffer. In this pass, only those fragments are evaluated where the depth buffer has 0.8 value, i.e. where the previous condition was true. Then, in the second pass, the fragment shader is set to compute while the depth buffer is turned to keep those fragments where the fragment's depth is greater than the stored value.

In Subsection 28.7.1 we exploit early -culling to implement a variable length loop in fragment processors.

The vector processing principle assumes that the output is an array where elements are obtained independently. The array should be large enough to keep every shader processor busy. Clearly, if the array has just one or a few elements, then only one or a few shader processors may work at a time, so we loose the advantages of parallel processing.

In many algorithms, the final result is not a large array, but is a single value computed from the array. Thus, the algorithm should reduce the dimension of the output. Doing the reduction in a single step by producing a single texel would not benefit from the parallel architecture. Thus, reduction should also be executed in parallel, in multiple steps. This is possible if the operation needed to compute the result from the array is associative, which is the case for the most common operations, like sum, average, maximum, or minimum.

Suppose that the array is encoded by a two-dimensional texture. At a single phase, we downsample the texture by halving its linear resolution, i.e. replacing four neighboring texels by a single texel. The fragment shaders will compute the operation on four texels. If the original array has resolution, then reduction steps are needed to obtain a single output value. In the following example, we compute the sum of the elements of the input array (Figure 28.5). The CPU program renders a full screen quad in each iteration having divided the render target resolution by two:

Reduction( ) { Set uniform parameter arrayX to identify the input texture for(N /= 2 ; N ≥= 1; N /= 2) { // log_2 N iterations glViewport(0, 0, N, N); // Set render target dimensions to hold NxN elements glBegin(GL_QUADS); // Render a full screen quad glVertex4f(-1,-1, 0, 1); glVertex4f(-1, 1, 0, 1); glVertex4f( 1, 1, 0, 1); glVertex4f( 1,-1, 0, 1); glEnd( ); Copy render target to input texture arrayX } }

The fragment shader computes a single reduced texel from four texels as a summation in each iteration step:

float FragmentShaderSum( ) ( in float2 index : WPOS, // target pixel coordinates uniform samplerRECT arrayX, // input array x ) : COLOR // output is interpreted as a pixel color { float sum = texRECT(arrayX, 2 * index); sum += texRECT(arrayX, 2 * index + float2(1, 0)); sum += texRECT(arrayX, 2 * index + float2(1, 1)); sum += texRECT(arrayX, 2 * index + float2(0, 1)); return sum; }

Note that if we exploited the bi-linear filtering feature of the texture memory, then we could save three texture fetch operations and obtain the average in a single step.

In vector processing a processor is assigned to each output value, i.e. every processor should be aware which output element it is computing and it is not allowed to deroute its result to somewhere else. Such a static assignment is appropriate for **gathering** type computations. The general structure of gathering is that we may rely on a dynamically selected set of input elements but the variable where the output is stored is known a-priory:

index = ComputeIndex( ); // index of the input data y = F(x[index]);

Opposed to gathering, algorithms may have **scattering** characteristics, i.e. a given input value may end up in a variable that is selected dynamically. A simple scatter operation is:

index = ComputeIndex( ); // index of the output data y[index] = F(x);

Vector processing frameworks and our fragment shader implementation are unable to implement scatter since the fragment shader can only write to the pixel it has been assigned to.

If we wish to solve a problem having scattering type algorithm on the GPU, we have two options. First, we can restructure the algorithm to be of gathering type. Converting scattering type parallel algorithms to gathering type ones requires a change of our viewpoint how we look at the problem and its solution. For example, when integral equations or transport problems are considered, this corresponds to the solution of the adjoint problem [ 265 ]. Secondly, we can move the index calculation up to the pipeline and use the rasterizer to establish the dynamic correspondence between the index and the render target (Figure 28.6).

Let us consider a famous scattering type algorithm, **histogram generation**. Suppose we scan an input array of dimension , evaluate function for the elements, and calculate output array of dimension that stores the number of function values that are in bins equally subdividing range .

A scalar implementation of histogram generation would be:

Histogram( x ) { for(int i = 0; i ≤ M; i++) { index = (int)((F(x[i]) - Fmin)/(Fmax - Fmin) * N); // bin index = max(index, 0); index = min(index, N-1); y[index] = y[index] + 1; } }

We can see that the above function writes to the output array at random locations, meaning it cannot be implemented in a fragment shader which is only allowed to write the render target at its dedicated index. The problem of scattering will be solved by computing the index in the vertex shader but delegating the responsibility of incrementing to the rest of the pipeline. The indices are mapped to output pixels by the rasterization hardware. The problem of read-modify-write cycles might be solved by starting a new pass after each increment operation and copying the current render target as an input texture of the next rendering pass. However, this solution would have very poor performance and would not utilize the parallel hardware at all. A much better solution uses the arithmetic capabilities of the merging unit. The fragment shader generates just the increment (i.e. value 1) where the histogram needs to be updated and gives this value to the merging unit. The merging unit, in turn, adds the increment to the content of the render target.

The CPU program generates a point primitive for each input data element. Additionally, it sets the render target to match the output array and also enables the merging unit to execute add operations:

ScanInputVector( ) { Set uniform parameters Fmin, Fmax, N glDisable(GL_DEPTH_TEST); // Turn depth buffering off glBlendFunc(GL_ONE, GL_ONE); // Blending operation: dest = source * 1 + dest * 1; glEnable(GL_BLEND); // Enable blending glViewport(0, 0, N, 1); // Set render target dimensions to hold N elements glBegin(GL_POINTS); // Assign a point primitive to each input elements for(int i = 0; i ≤ M; i++) { glVertex1f( x[i] ); // an input element as a point primitive } glEnd( ); }

The vertex positions in this level are not important since it turns out later where this point will be mapped. So we use the first coordinate of the vertex to pass the current input element .

The vertex shader gets the position of the vertex currently storing the input element, and finds the location of this point in normalized device space. First, function is evaluated and the bin index is obtained, then we convert this index to the range since in normalized device space these will correspond to the extremes of the viewport:

void VertexShaderHistogram( in float inputPos : POSITION, out float4 outputPos : POSITION, uniform float Fmin, uniform float Fmax, uniform float N ) { float xi = inputPos; int index = (int)((F(xi) - Fmin)/(Fmax - Fmin) * N); // bin index = max(index, 0); index = min(index, N-1); float nindex = 2.0 * index / N - 1.0; // normalized device space outputPos = float4(nindex, 0, 0, 1); // set output coordinates }

The above example is not optimized. Note that the index calculation and the normalization could be merged together and we do not even need the size of the output array to execute this operation.

The fragment shader will be invoked for the pixel on which the point primitive is mapped. It simply outputs an increment value of 1:

float FragmentShaderIncr( ) : COLOR // output is interpreted as a pixel color { return 1; // increment that is added to the render target by merging }

**Figure 28.7. Caustics rendering is a practical use of histogram generation. The illumination intensity of the target will be proportional to the number of photons it receives (images courtesy of Dávid Balambér).**

Parallel processors running independently offer a linear speed up over equivalent scalar processor implementations. However, scalar processors may benefit from recognizing similar parts in the computation of different output values, so they can increase their performance utilizing **reuse**. As parallel processors may not reuse data generated by other processors, their comparative advantages become less attractive.

GPUs are parallel systems of significant streaming capabilities, so if data that can be reused are generated early, we can get the advantages of both independent parallel processing and the reuse features of scalar computing.

Our main stream expander is the rasterization. Thus anything happens before rasterization can be considered as a global computation for all those pixels that are filled with the rasterized version of the primitive. Alternatively, the result of a pass can be considered as an input texture in the next pass, so results obtained in the previous pass can be reused by all threads in the next pass.

**Exercises**

28.4-1 Implement a *parallel regula falsi equation solver* for that searches for roots in for many different and parameters. The and parameters are stored in a texture and the pixel shader is responsible for iteratively solving the equation for a particular parameter pair. Terminate the iteration when the error is below a given threshold. Take advantage of the early -culling hardware to prevent further refinement of the terminated iterations. Analyze the performance gain.

28.4-2 Based on the reduction scheme, write a program which applies simple linear *tone mapping* to a high dynamic range image stored in a floating-point texture. The scaling factor should be chosen to map the maximum texel value to the value of one. Find this maximum using iterative reduction of the texture.

28.4-3 Based on the concept of scatter, implement a *caustics renderer* program (Figure 28.7). The scene includes a point light source, a glass sphere, and a diffuse square that is visualized on the screen. Photons with random directions are generated by the CPU and passed to the GPU as point primitives. The vertex shader traces the photon through possible reflections or refractions and decides where the photon will eventually hit the diffuse square. The point primitive is directed to that pixel and the photon powers are added by additive alpha blending.

28.4-4 Based on the concept of scatter, given an array of GSM transmitter tower coordinates, compute cell phone signal strength on a 2D grid. Assume signal strength diminishes linearly with the distance to the nearest transmitter. Use the rasterizer to render circular features onto a 2D render target, and set up blending to pick the maximum.