28.3. 28.3 GPU as a vector processor

If the computation of the elements is done independently and without sharing temporary results, the parallel machines are called vector processors or array processors. As in the GPU hardware the fragment shader is associated with the elements of the output data, we use the fragment shader to evaluate output elements. Of course, the evaluation in a given processor must also be aware which element is being computed, which is the fundamental source of data dependency (it would not make sense to compute the very same data many times on a parallel machine). In the fragment shader, the index of the data element is in fact the pair of the pixel coordinates. This is available in screen space as a pair of two integers specifying the row and the column where the pixel is located.

Figure 28.4.  GPU as a vector processor.

GPU as a vector processor.


In the simplest, but practically the most important case, we wish to have a result in all pixels in a single rendering pass. So we have to select a geometric primitive that is mapped to all pixels in screen space and a single pixel is mapped only once. Such a geometric primitive is the virtual display itself, thus we should render a rectangle or a quadrilateral that represents the window of our virtual camera. In screen space, this is the viewport rectangle, in clipping space, this is a square on the plane and having corners in homogeneous coordinates , , , . This rectangle is also called the full screen quad and is processed by the hardware as two triangles (Figure 28.4).

Suppose that we want to compute an output array of dimension from an input array of possibly different dimension and a global parameter with function :

To set up the GPU for this computation, we assign output array to the output texture that is the current render target. Texture size is chosen according to the output size, and the viewport is set to cover the entire render target. A two-dimensional array of horizontal resolution and vertical resolution is capable of storing elements. If , then it is up to us how horizontal and vertical resolutions are found. However, GPUs may impose restrictions, e.g. they cannot be larger than or, if we wish to use them as input textures in the next pass or compute binary reductions, the resolutions are preferred to be powers of two. If power of two dimensions are advantageous but the dimension of the array is different, we can extend the array by additional void elements.

According to vector processing principles, different output values are computed independently without sharing temporary results. As in the GPU hardware the fragment shader is associated with the elements of the output data and can run independently of other elements, we use the fragment shader to evaluate function . To find its parameters, we need to know , i.e. which element is currently computed, and should have an access to input array . The simplest way is to store the input array as an input texture (or multiple input textures if that is more convenient) since the fragment shader can access textures.

The only responsibility of the CPU is to set the uniform parameters, specify the viewport and send a full screen quad down the pipeline. Uniform parameters select the input texture and define global parameter . Assuming the OpenGL API, the corresponding CPU program in C would look like the following:

       StartVectorOperation( ) {          Set uniform parameters p and arrayX identifying the input texture          glViewport(0, 0, H, V);      // Set horizontal and vertical resolutions, H and V          glBegin(GL_QUADS);      // The next four vertices define a quad             glVertex4f(-1,-1, 0, 1);      // Vertices assuming normalized device space             glVertex4f(-1, 1, 0, 1);             glVertex4f( 1, 1, 0, 1);             glVertex4f( 1,-1, 0, 1);          glEnd( ); }       }

Note that this program defines the rectangle directly in normalized device space using homogeneous coordinates passed as input parameters of the glVertex4f functions. So in the pipeline we should make sure that the vertices are not transformed.

For the shader program, the varying inputs are available in dedicated registers and outputs must also be written to dedicated registers. All of these registers are of type float4, that is, they can hold 4 float values. The role of the register is explained by its name. For example, the current value of the vertex position can be fetched from the POSITION register. Similar registers can store the texture coordinates or the color associated with this vertex.

The vertex shader gets the position of the vertex and is responsible for transforming it to the normalized device space. As we directly defined the vertices in normalized device space, the vertex shader simply copies the content of its input POSITION register to its output POSITION register (the input and output classification is given by the in and out keywords in front of the variable names assigned to registers):

       void VertexShader( in float4 inputPos : POSITION,                   out float4 outputPos : POSITION )       {          outputPos = inputPos;       }

The geometry shader should keep the rectangle as it is without changing the vertex coordinates. As this is the default operation for the geometry shader, we do not specify any program for it. The rectangle leaving the geometry shader goes to the clipping stage, which keeps it since we defined our rectangle to be inside the clipping region. Then, Cartesian coordinates are obtained from the homogeneous ones by dividing the first three coordinates by the fourth one. As we set all fourth homogeneous coordinates to 1, the first three coordinates are not altered. After homogeneous division, the fixed-function stage transforms the vertices of the rectangle to the vertices of a screen space rectangle having the coordinates equal to the corners of the viewport and the coordinate to 0.5. Finally, this rectangle is rasterized in screen space, so all pixels of the viewport are identified as a target, and the fragment shader is invoked for each of them.

The fragment shader is our real computing unit. It gets the input array and global parameter as uniform parameters and can also find out which pixel is being computed by reading the WPOS register:

       float FragmentShaderF(          in float2 index : WPOS, // target pixel coordinates          uniform samplerRECT arrayX, // input array          uniform float p // global parameter p          ) : COLOR // output is interpreted as a pixel color       {          float yi = F(index, arrayX, p); // F is the function to be evaluated          return yi;       }

In this program two input parameters were declared as uniform inputs by the uniform keyword, a float parameter p and the texture identification arrayX. The type of the texture is samplerRECT that determines the addressing modes how a texel can be selected. In this addressing mode, texel centers are on a two-dimensional integer grid. Note that here we used a different syntax to express what the output of the shader is. Instead of declaring a register as out, the output is given as a return value and the function itself, and is assigned to the output COLOR register.

28.3.1. 28.3.1 Implementing the SAXPY BLAS function

To show concrete examples, we first implement the level 1 functionality of the Basic Linear Algebra Subprograms (BLAS) library (http://www.netlib.org/blas/) that evaluates vector functions of the following general form:

where and are vectors and is a scalar parameter. This operation is called SAXPY in the BLAS library. Now our fragment shader inputs two textures, vector and the original version of vector . One fragment shader processor computes a single element of the output vector:

       float FragmentShaderSAXPY(          in float2 index : WPOS, // target pixel coordinates          uniform samplerRECT arrayX, // input array x          uniform samplerRECT arrayY, // original version of y          uniform float p // global parameter p          ) : COLOR // output is interpreted as a pixel color       {          float yoldi = texRECT(arrayY, index); // yoldi = arrayY[index]          float xi = texRECT(arrayX, index); // xi = arrayX[index]          float yi = p * xi + yoldi;          return yi;       }

Note that instead of indexing an array of CPU style programming, here we fetch the element from a texture that represents the array by the texRECT Cg function. The first parameter of the texRECT function is the identification number of a two-dimensional texture, which is passed from the CPU as a uniform parameter, and the second is the texture address pointing to the texel to be selected.

Here we can observe how we can handle the limitation that a shader can only read textures but is not allowed to write into it. In the SAXPY operation, vector is an input and simultaneously also the output of the operation. To resolve this, we assign two textures to vector . One is the original vector in texture arrayY, and the other one is the render target. While we read the original value, the new version is produced without reading back from the render target, which would not be possible.

28.3.2. 28.3.2 Image filtering

Another important example is the discrete convolution of two textures, an image and a filter kernel, which is the basic operation in many image processing algorithms:

where is the filtered value at pixel , is the original image, and is the filter kernel, which spans over pixels.

Now the fragment shader is responsible for the evaluation of a single output pixel according to the input image given as texture Image and the filter kernel stored in a smaller texture Weight. The half size of the filter kernel is passed as a uniform variable:

       float3 FragmentShaderConvolution(          in float2 index : WPOS, // target pixel coordinates          uniform samplerRECT Image, // input image          uniform samplerRECT Weight, // filter kernel          uniform float M // size of the filter kernel          ) : COLOR // a pixel of the filtered image       {          float3 filtered = float3(0, 0, 0);          for(int i = -M; i ≤= M; i++)             for(int j = -M; j ≤= M; j++) {                float2 kernelIndex = float2(i, j);                float2 sourceIndex = index + kernelIndex;                filtered += texRECT(Image, sourceIndex) * texRECT(Weight, kernelIndex);             }          }          return filtered;       }

Note that this example was a linear, i.e. convolution filter, but non-linear filters (e.g. median filtering) could be implemented similarly. In this program we applied arithmetic operators (*, +=, =) for float2 and float3 type variables storing two and three floats, respectively. The Cg compiler and the GPU will execute these instructions independently on the float elements.

Note also that we did not care what happens at the edges of the image, the texture is always fetched with the sum of the target address and the shift of the filter kernel. A CPU implementation ignoring image boundaries would obviously be wrong, since we would over-index the source array. However, the texture fetching hardware implementing, for example, the texRECT function automatically solves this problem. When the texture is initialized, we can specify what should happen if the texture coordinate is out of its domain. Depending on the selected option, we get the closest texel back, or a default value, or the address is interpreted in a periodic way.

Exercises

28.3-1 Following the vector processing concept, write a pixel shader which, when a full screen quad is rendered, quantizes the colors of an input texture to a few levels in all three channels, achieving a cell shading effect.

28.3-2 Following the gathering data processing scheme, write a pixel shader which, when a full screen quad is rendered, performs median filtering on an input grayscale image, achieving dot noise reduction. The shader should fetch nine texel values from a neighborhood of , outputting the fifth largest.

28.3-3 Implement an anisotropic, edge preserving low-pass image filter with the gathering data processing scheme. In order to preserve edges, compute the Euclidean distance of the original pixel color and the color of a neighboring pixel, and include the neighbor in the averaging operation only when the distance is below a threshold.

28.3-4 Write a parallel Mandelbrot set rendering program by assuming that pixel corresponds to complex number and deciding whether or not the iteration diverges when started from . The divergence may be checked by iterating times and examining that is large enough. Divergent points are depicted with white, non-divergent points with black.