Problems emerging in physics or engineering are usually described mathematically as a set of partial differential or integral equations. As physical systems expand in space and time, derivatives or integrals should be evaluated both in temporal and spatial domains.
Figure 28.8. Finite element representations of functions. The texture filtering of the GPU directly supports finite element representations using regularly placed samples in one, two, and threedimensions and interpolating with piecewise constant and piecewise linear basis functions.
When we have to represent a value over space and time, we should use functions having the spatial position and the time as their variables. The representation of general functions would require infinite amount of data, so in numerical methods we only approximate them with finite number of values. Intuitively, these values can be imagined as the function values at discrete points and time instances. The theory behind this is the finite element method. If we need to represent function with finite data, we approximate the function in the following finite series form (Figure 28.8):
where are predefined basis functions and are the coefficients that describe .
A particularly simple finite element representation is the piecewise linear scheme that finds possibly regularly placed sample points in the domain, evaluates the function at these points to obtain the coefficients and linearly interpolates between and .
When the system is dynamic, solution will be time dependent, so a new finite element representation is needed for every time instance. We have basically two options for this. We can set sample points in a static way and allow only coefficients to change in time. This approach is called Eulerian. On the other hand, we can also allow the sample points to move with the evaluation of the system, making also sample points time dependent. This is the Lagrangian approach, where sample locations are also called particles.
Intuitive examples of Eulerian and Lagrangian discretization schemes are how temperature and other attributes are measured in meteorology. In ground stations, these data are measured at fixed locations. However, meteorological balloons can also provide the same data, but from varying positions that follow the flow of the air.
In this section we discuss a case study for GPUbased scientific computation. The selected problem is computational fluid dynamics. Many phenomena that can be seen in nature like smoke, cloud formation, fire, and explosion show fluidlike behavior. Understandably, there is a need for good and fast fluid solvers both in engineering and in computer animation.
The mathematical model of the fluid motion is given by the NavierStokes equation. First we introduce this partial differential equation, then discuss how GPUbased Eulerian and Langrangian solvers can be developed for it.
A fluid with constant density and temperature can be described by its velocity and pressure fields. The velocity and the pressure vary both in space and time:
Let us focus on a fluid element of unit volume that is at point at time . At an earlier time instance , this fluid element was in and, according to the fundamental law of dynamics, its velocity changed according to an acceleration that is equal to total force divided by mass of this unit volume fluid element:
Mass of a unit volume fluid element is called the fluid density. Moving the velocity terms to the left side and dividing the equation by , we can express the substantial derivative of the velocity:
The total force can stem from different sources. It may be due to the pressure differences:
where is the gradient of the pressure field. The minus sign indicates that the pressure accelerates the fluid element towards the low pressure regions. Here we used the nabla operator, which has the following form in a Cartesian coordinate system:
Due to friction, the fluid motion is damped. This damping depends on the viscosity of the fluid. Highly viscous fluids like syrup stick together, while lowviscosity fluids flow freely. The total damping force is expressed as a diffusion term since the viscosity force is proportional to the Laplacian of the velocity field:
Finally, an external force field may also act on our fluid element causing acceleration. In the gravity field of the Earth, assuming that the vertical direction is axis , this external acceleration is where [ ].
Adding the forces together, we can obtain the NavierStokes equation for the velocity of our fluid element:
In fact, this equation is the adaptation of the fundamental law of dynamics for fluids. If there is no external force field, the momentum of the dynamic system must be preserved. This is why this equation is also called momentum conservation equation.
Closed physics systems preserve not only the momentum but also the mass, so this aspect should also be built into our fluid model. Simply put, the mass conservation means that what flows into a volume must also flow out, so the divergence of the mass flow is zero. If the fluid is incompressible, then the fluid density is constant, thus the mass flow is proportional to the velocity field. For incompressible fluids, the mass conservation means that the velocity field is divergence free:
The Eulerian approach tracks the evolution of the velocity and pressure fields on fixed, uniform grid points. The grid allows a simple approximation of spatial derivatives by finite differences. If the grid points are in distances , , and along the three coordinate axes and the values of scalar field and vector field at grid point are and , respectively, then the gradient, the divergence and the Laplacian operators can be approximated as:
The NavierStokes equation and the requirement that the velocity is divergence free define four scalar equations (the conservation of momentum is a vector equation) with four scalar unknowns (, , , ). The numerical solver computes the current fields advancing the time in discrete steps of length :
The velocity field is updated in several steps, each considering a single term on the right side of this equation. Let us consider these steps onebyone.
To initialize the new velocity field at point , we fetch the previous field at position since the fluid element arriving at point was there in the previous time step [ 261 ]. This step computes advection, i.e. the phenomenon that the fluid carries its own velocity field:
To damp the velocity field, we could update it proportionally to a diffusion term:
However, this type of forward Euler integrator is numerically unstable. The reason of instability is that forward methods predict the future based on the present values, and as time passes, each simulation step adds some error, which may accumulate and exceed any limit.
Unlike forward integrators, a backward method can guarantee stability. A backward looking approach is stable since while predicting the future, it simultaneously corrects the past. Thus, the total error converges to a finite value and remains bounded. Here a backward method means that the Laplacian is obtained from the future, yet unknown velocity field, and not from the current velocity field:
At this step of the computation, the advected field is available at the grid points, the unknowns are the diffused velocity for each of the grid points. Using (28.5) to compute the Laplacian of the coordinates of unknown vector field at grid point , we observe that it will be a linear function of the velocities in the grid point and its neighbors. Thus, (28.6) is a sparse linear system of equations:
where vector is the vector of the known velocities obtained by advection, is the vector of unknown velocities of the grid points, and matrixvector multiplication represents the discrete form of .
Such systems are primary candidates for Jacobi iteration (see Chapter 12 of this book, titled Scientific Computation). Initially we fill vector with zero and evaluate the right side of (28.7) iteratively, moving the result of the previous step to vector of the right side. Thus, we traced back the problem to a sequence of sparse vectormatrix multiplications. Note that matrix needs not be stored. When velocity field is needed at a grid point, the neighbors are looked up and the simple formula of (28.5) gives us the result.
Updating a value in a grid point according to its previous value and the values of its neighbors are called image filtering. Thus, a single step of the Jacobi iteration is equivalent to an image filtering operation, which is discussed in Section 28.3.2.
So far, we calculated an updated velocity field without considering the unknown pressure field. In the projection step, we compute the unknown pressure field and update the velocity field with it:
The pressure field is obtained from the requirement that the final velocity field must be divergence free. Let us apply the divergence operator to both sides of this equation. After this, the left side becomes zero since we aim at a divergence free vector field for which :
Assuming a regular grid where vector field is available, searching the unknown pressure at grid positions, and evaluating the divergence and the Laplacian with finite differences of equations (28.4) and (28.5), respectively, we again end up with a sparse linear system for the discrete pressure values and consequently for the difference between the final velocity field and . This system is also solved with Jacobi iteration. Similarly to the diffusion step, the Jacobi iteration of the projection is also a simple image filtering operation.
The discretized velocity and pressure fields can be conveniently stored in threedimensional textures, where discrete variables are defined at the centers of elemental cubes, called voxels of a grid [ 106 ]. At each time step, the content of these data sets should be refreshed (Figure 28.9).
The representation of the fields in textures has an important advantage when the advection is computed. The advected field at voxel center is obtained by copying the field value at position . Note that the computed position is not necessarily a voxel center, but it can be between the grid points. According to the finite element concept, this value can be generated from the finite element representation of the data. If we assume piecewise linear basis functions, then the texture filtering hardware automatically solves this problem for us at no additional computation cost.
Figure 28.10. Computation of the simulation steps by updating threedimensional textures. Advection utilizes the texture filtering hardware. The linear equations of the viscosity damping and projection are solved by Jacobi iteration, where a texel (i.e. voxel) is updated with the weighted sum of its neighbors, making a single Jacobi iteration step equivalent to an image filtering operation.
The disadvantage of storing vector and scalar fields in threedimensional textures is that the GPU can only read these textures no matter whether we take the graphics API or the GPGPU approach. The updated field must be written to the render target in case of the graphics API approach, and to the global memory if we use a GPGPU interface. Then, for the next simulation step, the last render target or global memory should be declared as an input texture.
In order to avoid write collisions, we follow a gathering approach and assign threads to each of the grid points storing output values. If GPUs fetch global data via textures, then the new value written by a thread becomes visible when the pass or the thread run is over, and the output is declared as an input texture for the next run. Thus, the computation of the time step should be decomposed to elemental update steps when the new output value of another grid point is needed. It means that we have and advection pass, a sequence of Jacobi iteration passes of the diffusion step, an external force calculation pass, and another sequence of Jacobi iteration passes of the projection step. With a GPGPU framework, a thread may directly read the data produced by another thread, but then synchronization is needed to make sure that the read value is already valid, so not the old but the new value is fetched. In such cases, synchronization points have the same role and passes or decomposed kernels.
In case of graphics APIs, there is one additional limitation. The render target can only be twodimensional, thus either we flatten the layers of the threedimensional voxel array into a large twodimensional texture, or update just a single layer at a time. Flattened threedimensional textures are shown by Figure 28.11. Once the textures are set up, one simulation step of the volume can be done by the rendering of a quad covering the flattened grid.
The graphics API approach has not only drawbacks but also an advantage over the GPGPU method, when the linear systems are solved with Jacobi iteration. The graphics API method runs the fragment shader for each grid point to update the solution in the texel associated with the grid point. However, if the neighbor elements of a particular grid point are negligible, we need less iteration steps than in a grid point where the neighbor elements are significant. In a quasiSIMD machine like the GPU, iterating less in some of the processors is usually a bad idea. However, the exploitation of the early culling hardware helps to sidestep this problem and boosts the performance [ 267 ]. The coordinate in the depth value is set proportionally to the maximum element in the neighborhood and to the iteration count. This way, as the iteration proceeds, the GPU processes less and less number of fragments, and can concentrate on important regions. According to our measurements, this optimization reduces the total simulation time by about 40%.
When we wish to visualize the flow, we can also assume that the flow carries a scalar display variable with itself. The display variable is analogous with some paint or confetti poured into the flow. The display variable is stored in a float voxel array.
Using the advection formula for display variable , its field can also be updated in parallel with the simulation of time step :
At a time, the color and opacity of a point can be obtained from the display variable using a user controlled transfer function.
We can use a 3D texture slicing rendering method to display the resulting display variable field, which means that we place semitransparent polygons perpendicular to the view plane and blend them together in back to front order (Figure 28.12). The color and the opacity of the 3D texture is the function of the 3D display variable field.
In the Lagrangian approach, the space is discretized by identifying particles, i.e. following just finite number of fluid elements. Let us denote the position and the velocity of the th discrete fluid element by and , respectively. We assume that all particles represent fluid elements of the same mass , but as the density varies in space and will be the attribute of the particle, every particle is associated with a different volume of the fluid. The momentum conservation equation has the following form in this case:
If particles do not get lost during the simulation, the mass is automatically conserved. However, temporarily this mass may concentrate in smaller parts of the volume, so the simulated fluid is not incompressible. In Lagrangian simulation, we usually assume compressible gas.
From the knowledge of the system at discrete points, attributes are obtained at an arbitrary point via interpolation. Suppose we know an attribute at the particle locations, i.e. we have . Attribute is interpolated at location by a weighted sum of contributions from the particles:
where is the volume represented by the particle in point , and is a smoothing kernel, also called radial basis function, that depends on distance between the particle location and the point of interest. From a different point of view, the smoothing kernel expresses how quickly the impact of a particle diminishes farther away. The smoothing kernel is normalized if smoothing preserves the total amount of the attribute value, which is the case if the kernel has unit integral over the whole volumetric domain. An example for the possible kernels is the spiky kernel of maximum radius :
For normalized kernels, the particle density at point is approximated as:
As each particle has the same mass , the volume represented by particle is
According to the ideal gas law, the pressure is inversely proportional to the volume on constant temperature, thus at particle the pressure is
where constant depends on the temperature.
The pressure at an arbitrary point is
The acceleration due to pressure differences requires the computation of the gradient of the pressure field. As spatial variable shows up only in the smoothing kernel, the gradient can be computed by using the gradient of the smoothing kernel:
Thus, our first guess for the pressure force at particle is:
However, there is a problem here. Our approximation scheme could not guarantee to satisfy the physical rules including symmetry of forces and consequently the conservation of momentum. We should make sure that the force on particle due to particle is always equal to the force on particle due to particle . The symmetric relation can be ensured by modifying the pressure force in the following way:
The viscosity term contains the Laplacian of the vector field, which can be computed by using the Laplacian of the smoothing kernel:
Similarly to the pressure force, a symmetrized version is used instead that makes the forces symmetric:
External forces can be directly applied to particles. Particleobject collisions are solved by reflecting the velocity component that is perpendicular to the surface.
Having computed all forces, and approximating the time derivatives of (28.8) by finite differences, we may obtain the positions and velocities of each of the particles in the following way:
Note that this is also a forward Euler integration scheme, which has stability problems. Instead of this, we should use a stable version, for example, the Verlet integration [ 67 ].
The Lagrangian approach tracks a finite number of particles where the forces acting on them depend on the locations and actual properties of other particles. Thus, to update a system of particles, interactions should be examined. Such tasks are generally referred to as the Nbody problem.
In a GPGPU framework, the particle attributes can be stored in the global memory as a onedimensional array or can be fetched via onedimensional textures. In graphics API frameworks, particle attributes can only be represented by textures. The advantage of reading the data via textures is only the better caching since now we cannot utilize the texture filtering hardware. A gathering type method would assign a thread to each of the controlled particles, and a thread would compute the effect of other particles on its own particle. As the smoothing kernel has finite support, only those particles can interact with the considered one, which are not farther than the maximum radius of the smoothing filter. It is worth identifying these particles only once, storing them in a twodimensional texture of in the global memory, and using this information in all subsequent kernels.
Figure 28.13. Data structures stored in arrays or textures. Onedimensional float3 arrays store the particles' position and velocity. A onedimensional float2 texture stores the computed density and pressure. Finally, a twodimensional texture identifies nearby particles for each particle.
A GPGPU approach would need three onedimensional arrays representing the particle position, velocity, density and pressure, and a twodimensional array for the neighboring particles (Figure 28.13). In a graphics API approach, these are one or twodimensional textures. We can run a kernel or a fragment shader for each of the particles. In a GPGPU solution it poses no problem for the kernel to output a complete column of the neighborhood array, but in the fragment shaders of older GPUs the maximum size of a single fragment is limited. To solve this, we may limit the number of considered neighbor particles to the number that can be outputted with the available multiple render target option.
Figure 28.14. A time step of the Lagrangian solver. The considered particle is the red one, and its neighbors are yellow.
The processing of a single particle should be decomposed to passes or kernel runs when we would like to use the already updated properties of other particles (Figure 28.14). The first pass is the identification of the neighbors for each particles, i.e. those other particles that are closer than the support of the smoothing kernel. The output of this step is a twodimensional array where columns are selected by the index of the considered particle and the elements in this column store the index and the distance of those particles that are close by.
The second pass calculates the density and the pressure from the number and the distance of the nearby particles. Having finished this pass, the pressure of every particle will be available for all threads. The third pass computes the forces from the pressure and the velocity of nearby particles. Finally, each particle gets its updated velocity and is moved to its new position.
Figure 28.15. Animations obtained with a Lagrangian solver rendering particles with spheres (upper image) and generating the isosurface (lower image) [ 114 ].
Having obtained the particle positions, the system can be visualized by different methods. For example, we can render a point or a small sphere for each particle (upper image of Figure 28.15). Alternatively, we can splat particles onto the screen, resulting in a rendering style similar to that of the Eulerian solver (Figure 28.12). Finally, we can also find the surface of the fluid and compute reflections and refractions here using the laws of geometric optics (lower image of Figure 28.15). The surface of fluid is the isosurface of the density field, which is the solution of the following implicit equation:
This equation can be solved for points visible in the virtual camera by ray marching. We trace a ray from the eye position through the pixel and make small steps on it. At every sample position we check whether the interpolated density has exceeded the specified isovalue . The first step when this happens is the intersection of the ray and the isosurface. The rays are continued from here into the reflection and refraction directions. The computation of these directions also requires the normal vector of the isosurface, which can be calculated as the gradient of the density field.
Exercises
28.71 Implement a gameoflife in CUDA. On a twodimensional grid of cells, every cell is either populated of unpopulated. In every step, all cell states are reevaluated. For populated cells:
Each cell with one or no neighbors dies, as if by loneliness.
Each cell with four or more neighbors dies, as if by overpopulation.
Each cell with two or three neighbors survives.
For unpopulated cells:
Each cell with three neighbors becomes populated.
Store cell states in arrays accessible as textures. Always compute the next iteration state into a different output array. Start with a random grid and display results using the graphics API.
28.72 Implement a wave equation solver. The wave equation is a partial differential equation:
where is the wave height above point in time , and is the speed of the wave.
CHAPTER NOTES

The fixed transformation and multitexturing hardware of GPUs became programmable vertex and fragment shaders about a decade ago. The high floating point processing performance of GPUs has quickly created the need to use them not only for incremental rendering but for other algorithms as well. The first GPGPU algorithms were also graphics related, e.g. ray tracing or the simulation of natural phenomena. An excellent review about the early years of GPGPU computing can be found in [ 200 ]. Computer graphics researchers have been very enthusiastic to work with the new hardware since its general purpose features allowed them to implement algorithms that are conceptually different from the incremental rendering, including the physically plausible light transport, called global illumination [ 264 ], physics simulation of rigid body motion with accurate collision detection, fluid dynamics etc., which made realistic simulation and rendering possible in realtime systems and games. The GPU Gems book series [ 75 ], [ 194 ], [ 205 ] and the ShaderX (currently GPU Pro [ 70 ]) series provide a huge collection of such methods.
Since the emergence of GPGPU platforms like CUDA and OpenCL, GPU solutions have showed up in all fields of high performance computing. Online warehouses of papers and programs are the gpgpu.org homepage and the NVIDIA homepage [ 195 ], [ 196 ], which demonstrate the wide acceptance of this approach in many fields. Without aiming at completeness, successful GPU applications have targeted high performance computing tasks including simulation of all kinds of physics phenomena, differential equations, tomographic reconstruction, computer vision, database searches and compression, linear algebra, signal processing, molecular dynamics and docking, financial informatics, virus detection, finite element methods, Monte Carlo methods, simulation of computing machines (CNN, neural networks, quantum computers), pattern matching, DNA sequence alignment, cryptography, digital holography, quantum chemistry, etc.
To get a scalable system that is not limited by the memory of a single GPU card, we can build GPU clusters. A single PC can be equipped with four GPUs and the number of interconnected PCs is unlimited [ 293 ]. However, in such systems the communication will be the bottleneck since current communication channels cannot compete with the computing power of GPUs.
The European Union and the European Social Fund under the grant agreement no. TÁMOP 4.2.1/B09/1/KMR20100003.