The general form of linear algebraic systems with unknowns and equations is given by

This system can be written in the more compact form

where

The systems is called *underdetermined* if . For , the systems is called *overdetermined*. Here we investigate only the case , when the coefficient matrix is square. We also assume that the inverse matrix exists (or equivalently ). Under this assumption the linear system has exactly one solution: .

**Definition 12.4 **
*The matrix is upper triangular if for all . The matrix is lower triangular if for all .*

For example the general form of the upper triangular matrices is the following:

We note that the diagonal matrices are both lower and upper triangular. It is easy to show that holds for the upper or lower triangular matrices. It is easy to solve linear systems with triangular coefficient matrices. Consider the following upper triangular linear system:

This can be solved by the so called *back substitution* algorithm.

*
Back-Substitution(
)
*

1 2`FOR`

3`DOWNTO`

4`DO`

`RETURN`

The solution of lower triangular systems is similar.

**The Gauss method.** The Gauss method or Gaussian elimination (GE) consists of two phases:

I. The linear system is transformed to an equivalent upper triangular system using elementary operations (see Figure 12.2).

II. The obtained upper triangular system is then solved by the back substitution algorithm.

The first phase is often called the elimination or forward phase. The second phase of GE is called the backward phase. The elementary operations are of the following three types:

1. Add a multiple of one equation to another equation.

2. Interchange two equations.

3. Multiply an equation by a nonzero constant.

The elimination phase of GE is based on the following observation. Multiply equation by and subtract it from equation :

If , then by choosing , the coefficient of becomes in the new equivalent equation, which replaces equation . Thus we can eliminate variable (or coefficient ) from equation .

The Gauss method eliminates the coefficients (variables) under the main diagonal of in a systematic way. First variable is eliminated from equations using equation , then is eliminated from equations using equation , and so on.

Assume that the unknowns are eliminated in the first columns under the main diagonal and the resulting linear system has the form

If , then multiplying row by and subtracting it from equation we obtain

Since for , we eliminated the coefficient (variable ) from equation . Repeating this process for we can eliminate the coefficients under the main diagonal entry . Next we denote by the element of matrix and by the vector . The Gauss method has the following form (where the *pivoting* discussed later is also included):

*
Gauss-Method(
)
*

1 Forward phase: 2 3`FOR`

4`TO`

{pivoting and interchange of rows and columns} 5`DO`

`FOR`

6`TO`

7 8 9 Backward phase: see the back substitution algorithm. 10`DO`

`RETURN`

The algorithm overwrites the original matrix and vector . It does not write however the zero entries under the main diagonal since these elements are not necessary for the second phase of the algorithm. Hence the lower triangular part of matrix can be used to store information for the decomposition of matrix .

The above version of the Gauss method can be performed only if the elements occurring in the computation are not zero. For this and numerical stability reasons we use the Gaussian elimination with *pivoting*.

If , then we can interchange row with another row, say , so that the new entry () at position should be nonzero. If this is not possible, then all the coefficients are zero and . In the latter case has no unique solution. The element is called the
**pivot element**. We can always select new pivot elements by interchanging the rows. The selection of the pivot element has a great influence on the reliability of the computed results. The simple fact that we divide by the pivot element indicates this influence. We recall that is proportional to . It is considered advantageous if the pivot element is selected so that it has the greatest possible modulus. The process of selecting the pivot element is called *pivoting*. We mention the following two pivoting processes.

**Partial pivoting:** At the step, interchange the rows of the matrix so the largest remaining element, say , in the column is used as pivot. After the pivoting we have

**Complete pivoting:** At the step, interchange both the rows and columns of the matrix so that the largest element, say , in the remaining matrix is used as pivot After the pivoting we have

Note that the interchange of two columns implies the interchange of the corresponding unknowns. The significance of pivoting is well illustrated by the following

**Example 12.6 **The exact solution of the linear system

is and . The MATLAB program gives the result , and this is the best available result in standard double precision arithmetic. Solving this system with the Gaussian elimination without pivoting (also in double precision) we obtain the catastrophic result and . Using partial pivoting with the Gaussian elimination we obtain the best available numerical result .

**Remark 12.5 **
*Theoretically we do not need pivoting in the following cases: 1. If is symmetric and positive definite ( is positive definite , , ). 2. If is diagonally dominant in the following sense:*

*In case of symmetric and positive definite matrices we use the Cholesky method which is a special version of the Gauss-type methods.*

During the Gaussian elimination we obtain a sequence of equivalent linear systems

where

Note that matrices are stored in the place of . The last coefficient matrix of phase I has the form

where is the pivot element. The **growth factor of pivot elements** is given by

Wilkinson proved that the error of the computed solution is proportional to the growth factor and the bounds

and

hold for complete and partial pivoting, respectively. Wilkinson conjectured that for complete pivoting. This has been proved by researchers for small values of . Statistical investigations on random matrices () indicate that the average of is for the partial pivoting and for the complete pivoting. Hence the case hardly occurs in the statistical sense.

We remark that Wilkinson constructed a linear system on which for the partial pivoting. Hence Wilkinson's bound for is sharp in the case of partial pivoting. There also exist examples of linear systems concerning discretisations of differential and integral equations, where is increasing exponentially if Gaussian elimination is used with partial pivoting.

The growth factor can be very large, if the Gaussian elimination is used without pivoting. For example, , if

**Operations counts.** The Gauss method gives the solution of the linear system () in a finite number of steps and arithmetic operations . The amount of necessary arithmetic operations is an important characteristic of the direct linear system solvers, since the CPU time is largely proportional to the number of arithmetic operations. It was also observed that the number of additive and multiplicative operations are nearly the same in the numerical algorithms of linear algebra. For measuring the cost of such algorithms C. B. Moler introduced the concept of flop.

**Definition 12.6 **
*
(One (old) flop) is the computational work necessary for the operation (1 addition + 1 multiplication). One (new) flop is the computational work necessary for any of the arithmetic operations .*

The new flop can be used if the computational time of additive and multiplicative operations are approximately the same. Two new flops equals to one old flop. Here we use the notion of old flop.

For the Gauss method a simple counting gives the number of additive and multiplicative operations.

**Theorem 12.7 **
*The computational cost of the Gauss method is flops.*

V. V. Klyuyev and N. Kokovkin-Shcherbak proved that if only elementary row and column operations (multiplication of row or column by a number, interchange of rows or columns, addition of a multiple of row or column to another row or column) are allowed, then the linear system cannot be solved in less than flops.

Using fast matrix inversion procedures we can solve the linear system in flops. These theoretically interesting algorithms are not used in practice since they are considered as numerically unstable.

**The -decomposition.** In many cases it is easier to solve a linear system if the coefficient matrix can be decomposed into the product of two triangular matrices.

**Definition 12.8 **
*The matrix has an
-decomposition, if , where is lower and is upper triangular matrix.*

The -decomposition is not unique. If a nonsingular matrix has an -decomposition, then it has a particular -decomposition, where the main diagonal of a given component matrix consists of 's. Such triangular matrices are called **unit (upper or lower) triangular** matrices. The decomposition is unique, if is set to be lower unit triangular or is set to be unit upper triangular.

The -decomposition of nonsingular matrices is closely related to the Gaussian elimination method. If , where is unit lower triangular, then (), where is given by the Gauss algorithm. The matrix is the upper triangular part of the matrix we obtain at the end of the forward phase. The matrix can also be derived from this matrix, if the columns of the lower triangular part are divided by the corresponding main diagonal elements. We remind that the first phase of the Gaussian elimination does not annihilate the matrix elements under the main diagonal. It is clear that a nonsingular matrix has -decomposition if and only if holds for each pivot element for the Gauss method without pivoting.

**Definition 12.9 **
*A matrix whose every row and column has one and only one non-zero element, that element being , is called a permutation matrix.*

In case of partial pivoting we permute the rows of the coefficient matrix (multiply by a permutation matrix on the left) so that holds for a nonsingular matrix. Hence we have

**Theorem 12.10 **
*If is nonsingular then there exists a permutation matrix such that has an -decomposition.*

The the algorithm of -decomposition is essentially the Gaussian elimination method. If pivoting is used then the interchange of rows must also be executed on the elements under the main diagonal and the permutation matrix must be recorded. A vector containing the actual order of the original matrix rows is obviously sufficient for this purpose.

**The - and Cholesky-methods.** Let and consider the equation . Since , we can decompose into the equivalent linear system and , where is lower triangular and is upper triangular.

*
LU-Method(
)
*

` 1 Determine the -decomposition . 2 Solve . 3 Solve . 4 `**
**`RETURN`

*Remark.* In case of partial pivoting we obtain the decomposition and we set instead of .

In the first phase of the Gauss method we produce decomposition and the equivalent linear system with upper triangular coefficient matrix. The latter is solved in the second phase. In the -method we decompose the first phase of the Gauss method into two steps. In the first step we obtain only the decomposition . In the second step we produce the vector . The third step of the algorithm is identical with the second phase of the original Gauss method.

The -method is especially advantageous if we have to solve several linear systems with the same coefficient matrix:

In such a case we determine the -decomposition of matrix only once, and then we solve the linear systems , (, ). The computational cost of this process is flops.

The *inversion of a matrix*
can be done as follows:

1. Determine the -decomposition . .

2. Solve , ( is the unit vector ).

The inverse of is given by . The computational cost of the algorithm is flops.

This implementation of the -method is known since the 60's. Vector contains the indices of the rows. At the start we set (). When exchanging rows we exchange only those components of vector that correspond to the rows.

*
LU-Method-with-Pointers(
)
*

1 2 3`FOR`

4`TO`

compute index such that . 5`DO`

6`IF`

exchange the components and . 7`THEN`

`FOR`

8`TO`

9 10`DO`

`FOR`

11`TO`

12`DO`

`FOR`

13`TO`

14 15`DO`

`FOR`

16`DOWNTO`

17`DO`

`FOR`

18 19 20`TO`

`RETURN`

If is *symmetric and positive definite*, then it can be decomposed in the form , where is lower triangular matrix. The -decomposition is called the **Cholesky-decomposition**. In this case we can save approximately half of the storage place for and half of the computational cost of the -decomposition (-decomposition). Let

Observing that only the first elements may be nonzero in the column of we obtain that

This gives the formulae

Using the notation () we can formulate the Cholesky-method as follows.

*
Cholesky-Method(
)
*

1 2`FOR`

3`TO`

4`DO`

`FOR`

5`TO`

6`DO`

`RETURN`

The lower triangular part of contains . The computational cost of the algorithm is flops and square roots. The algorithm, which can be considered as a special case of the Gauss-methods, does not require pivoting, at least in principle.

It often happens that linear systems have *banded coefficient matrices*.

**Definition 12.11 **
*Matrix is banded with lower bandwidth and upper bandwidth if*

The possibly non-zero elements () form a band like structure. Schematically has the form

The *banded matrices* yield very efficient algorithms if and are significantly less than . If a banded matrix with lower bandwidth and upper bandwidth has an -decomposition, then both and are banded with lower bandwidth and upper bandwidth , respectively.

Next we give the -method for banded matrices in three parts.

*
The-LU-Decomposition-of-Banded-Matrix(
)
*

1`FOR`

2`TO`

`DO`

`FOR`

3`TO`

4`DO`

`FOR`

5`TO`

6`DO`

`RETURN`

Entry is overwritten by , if and by , if . The computational cost of is flops, where

The following algorithm overwrites by the solution of equation .

*
Solution-of-Banded-Unit-Lower-Triangular-System(
)
*

1`FOR`

2`TO`

3`DO`

`RETURN`

The total cost of the algorithm is flops. The next algorithm overwrites vector by the solution of .

*
Solution-of-Banded-Upper-Triangular-System(
)
*

1`FOR`

2`DOWNTO`

3`DO`

`RETURN`

The computational cost is flops.

Assume that is symmetric, positive definite and banded with lower bandwidth . The banded version of the Cholesky-methods is given by

*
Cholesky-decomposition-of-Banded-Matrices(
)
*

1`FOR`

2`TO`

`DO`

`FOR`

3`TO`

4 5`DO`

`RETURN`

The elements are overwritten by (). The total amount of work is given by flops és square roots.

*Remark.* If has lower bandwidth and upper bandwidth and partial pivoting takes place, then the upper bandwidth of increases up to .

There are several iterative methods for solving linear systems of algebraic equations. The best known iterative algorithms are the classical *Jacobi*-, the *Gauss-Seidel*- and the *relaxation methods*. The greatest advantage of these iterative algorithms is their easy implementation to large systems. At the same time they usually have slow convergence. However for parallel computers the *multisplitting* iterative algorithms seem to be efficient.

Consider the iteration

where és . It is known that converges for all , if and only if the spectral radius of satisfies ( is an eigenvalue of ). In case of convergence , that is we obtain the solution of the equation . The speed of convergence depends on the spectral radius . Smaller the spectral radius , faster the convergence.

Consider now the linear system

where is nonsingular. The matrices form a multisplitting of if

(i) , ,

(ii) is nonsingular, ,

(iii) is non-negative diagonal matrix, ,

(iv) .

Let be a given initial vector. The multisplitting iterative method is the following.

*
Multisplitting-Iteration(
)
*

1 2`WHILE`

exit condition=3`false`

4`DO`

`FOR`

5`TO`

6 7`DO`

`RETURN`

It is easy to show that and

Thus the condition of convergence is . The *multisplitting iteration* is a true *parallel algorithm* because we can solve linear systems parallel in each iteration (synchronised parallelism). The bottleneck of the algorithm is the computation of iterate .

The selection of matrices and is such that the solution of the linear system should be cheap. Let be a partition of , that is , () and . Furthermore let () be such that for at least one .

The **non-overlapping block Jacobi splitting** of is given by

for .

Define now the simple splitting

where is nonsingular,

It can be shown that

holds for the non-overlapping block Jacobi multisplitting.

The **overlapping block Jacobi multisplitting** of is defined by

for .

A nonsingular matrix is called an -matrix, if () and all the elements of are nonnegative.

**Theorem 12.12 **
*Assume that is nonsingular -matrix, is a non-overlapping, is an overlapping block Jacobi multisplitting of , where the weighting matrices are the same. The we have*

*where and .*

We can observe that both iteration procedures are convergent and the convergence of the overlapping multisplitting is not slower than that of the non-overlapping procedure. The theorem remains true if we use block Gauss-Seidel multisplittings instead of the block Jacobi multisplittings. In this case we replace the above defined matrices and with their lower triangular parts.

The multisplitting algorithm has multi-stage and asynchronous variants as well.

We analyse the direct and inverse errors. We use the following notations and concepts. The exact (theoretical) solution of is denoted by , while any approximate solution is denoted by . The direct error of the approximate solution is given by . The quantity is called the **residual error**. For the exact solution , while for the approximate solution

We use various models to estimate the inverse error. In the most general case we assume that the computed solution satisfies the linear system , where and . The quantities and are called the inverse errors.

One has to distinguish between the sensitivity of the problem and the stability of the solution algorithm. By **sensitivity of a problem** we mean the sensitivity of the solution to changes in the input parameters (data). By the **stability (or sensitivity) of an algorithm** we mean the influence of computational errors on the computed solution. We measure the sensitivity of a problem or algorithm in various ways. One such characterization is the “condition number”, which compares the relative errors of the input and output values.

The following general principles are used when applying any algorithm:

- We use only stable or well-conditioned algorithms.

- We cannot solve an unstable (ill-posed or ill-conditioned) problem with a general purpose algorithm, in general.

Assume that we solve the perturbed equation

instead of the original . Let and investigate the difference of the two solutions.

**Theorem 12.13 **
*If is nonsingular and , then*

*where is the condition number of .*

Here we can see that the condition number of may strongly influence the relative error of the perturbed solution . A linear algebraic system is said to be *well-conditioned* if is small, and *ill-conditioned*, if is big. It is clear that the terms “small” and “big” are relative and the condition number depends on the norm chosen. We identify the applied norm if it is essential for some reason. For example . The next example gives possible geometric characterization of the condition number.

**Example 12.7 **The linear system

is ill-conditioned (). The two lines, whose meshpoint defines the system, are almost parallel. Therefore if we perturb the right hand side, the new meshpoint of the two lines will be far from the previous meshpoint.

The inverse error is in the sensitivity model under investigation. Theorem 12.13 gives an estimate of the direct error which conforms with the thumb rule. It follows that we can expect a small relative error of the perturbed solution , if the condition number of is small.

**Example 12.8 **Consider the linear system with

Let . Then and , but .

Consider now the perturbed linear system

instead of . It can be proved that for this perturbation model there exist more than one *inverse errors* “inverse error” among which is the inverse error with minimal spectral norm, provided that .

The following theorem establish that for small relative residual error the relative inverse error is also small.

**Theorem 12.14 **
*Assume that is the approximate solution of , and . If , the the matrix satisfies and .*

If the relative inverse error and the condition number of are small, then the relative residual error is small.

**Theorem 12.15 **
*If , , , and , then*

If is ill-conditioned, then Theorem 12.15 is not true.

**Example 12.9 **Let , and , (). Then and . Let

Then and , which is not small.

In the most general case we solve the perturbed equation

instead of . The following general result holds.

**Theorem 12.16 **
*If is nonsingular, and , then*

This theorem implies the following “thumb rule”.

**Thumb rule.**
*Assume that . If the entries of and are accurate to about decimal places and , where , then the entries of the computed solution are accurate to about decimal places.*

The assumption of Theorem 12.16 guarantees that that matrix is nonsingular. The inequality is equivalent with the inequality and the distance of from the nearest singular matrix is just . Thus we can give a new characterization of the condition number:

Thus if a matrix is ill-conditioned, then it is close to a singular matrix. Earlier we defined the condition numbers of matrices as the condition number of the mapping .

Let us introduce the following definition.

**Definition 12.17 **
*A linear system solver is said to be weakly stable on a matrix class , if for all well-conditioned and for all , the computed solution of the linear system has small relative error .*

Putting together Theorems 12.13–12.16 we obtain the following.

**Theorem 12.18 **
**(Bunch)**
*A linear system solver is weakly stable on a matrix class , if for all well-conditioned and for all , the computed solution of the linear system satisfies any of the following conditions: *

*(1) is small; *

*(2) is small; *

*(3) There exists such that and are small.*

The estimate of Theorem 12.16 can be used in practice if we know estimates of and . If no estimates are available, then we can only make a posteriori error estimates.

In the following we study the componentwise error estimates. We first give an estimate for the absolute error of the approximate solution using the components of the inverse error.

**Theorem 12.19 **
**(Bauer, Skeel)**
*Let be nonsingular and assume that the approximate solution of satisfies the linear system . If , and are such that , , , and , then*

If (), and

then we obtain the estimate

The quantity is said to be **Skeel-norm**, although it is not a norm in the earlier defined sense. The Skeel-norm satisfies the inequality

Therefore the above estimate is not worse than the traditional one that uses the standard condition number.

The inverse error can be estimated componentwise by the following result of Oettli and Prager. Let and . Assume that and . Furthermore let

**Theorem 12.20 **
**(Oettli, Prager)**
*The computed solution satisfies a perturbed equation with and , if*

We do not need the condition number to apply this theorem. In practice the entries and are proportional to the machine epsilon.

**Theorem 12.21 **
**(Wilkinson)**
*The approximate solution of obtained by the Gauss method in floating point arithmetic satisfies the perturbed linear equation*

*with*

*where denotes the groth factor of the pivot elements and is the unit roundoff.*

Since is small in practice, the relative error

is also small. Therefore Theorem 12.18 implies that the Gauss method is weakly stable both for full and partial pivoting.

Wilkinson's theorem implies that

For a small condition number we can assume that . Using Theorems 12.21 and 12.16 (case ) we obtain the following estimate of the direct error:

The obtained result supports the thumb rule in the case of the Gauss method.

**Example 12.10 **Consider the following linear system whose coefficients can be represented exactly:

Here is big, but is negligible. The exact solution of the problem is , . The MATLAB gives the approximate solution , with the relative error

Since and , the result essentially corresponds to the Wilkinson theorem or the thumb rule. The Wilkinson theorem gives the bound

for the inverse error. If we use the Oettli-Prager theorem with the choice and , then we obtain the estimate . Since , this estimate is better than that of Wilkinson.

**Scaling and preconditioning.** Several matrices that occur in applications are ill-conditioned if their order is large. For example the famous Hilbert-matrix

has , if . There exist matrices with integer entries that can be represented exactly in standard IEEE754 floating point arithmetic while their condition number is approximately .

We have two main techniques to solve linear systems with large condition numbers. Either we use multiple precision arithmetic or decrease the condition number. There are two known forms of decreasing the condition number.

**1. Scaling.** We replace the linear system with the equation

where and are diagonal matrices.

We apply the Gauss method to this scaled system and get the solution . The quantity defines the requested solution. If the condition number of the matrix is smaller then we expect a smaller error in and consequently in . Various strategies are given to choose the scaling matrices and . One of the best known strategies is the **balancing** which forces every column and row of to have approximately the same norm. For example, if

where is the row vector of , the Euclidean norms of the rows of will be and the estimate

holds with . This means that optimally scales the rows of in an approximate sense.

The next example shows that the scaling may lead to bad results.

**Example 12.11 **Consider the matrix

for . It is easy to show that . Let

Then the scaled matrix

has the condition number , which a very large value for small .

**2. Preconditioning.** The preconditioning is very close to scaling. We rewrite the linear system with the equivalent form

where matrix is such that is smaller and is easily solvable.

The preconditioning is often used with iterative methods on linear systems with symmetric and positive definite matrices.

**A posteriori error estimates.** The a posteriori estimate of the error of an approximate solution is necessary to get some information on the reliability of the obtained result. There are plenty of such estimates. Here we show three estimates whose computational cost is flops. This cost is acceptable when comparing to the cost of direct or iterative methods ( or per iteration step).

**Theorem 12.22 **
**(Auchmuty)**
*Let be the approximate solution of . Then*

*where .*

The error constant depends on and the direction of error vector . Furthermore

The error constant takes the upper value only in exceptional cases. The computational experiments indicate that the average value of grows slowly with the order of and it depends more strongly on than the condition number of . The following experimental estimate

seems to hold with a high degree of probability.

The famous LINPACK program package uses the following process to estimate . We solve the linear systems and . Then the estimate of is given by

Since

we can interpret the process as an application of the power method of the eigenvalue problem. The estimate can be used with the and -norms. The entries of vector are possibly with random signs.

If the linear system is solved by the -method, then the solution of further linear systems costs flops per system. Thus the total cost of the LINPACK estimate remains small. Having the estimate we can easily estimate and the error of the approximate solution (cf. Theorem 12.16 or the thumb rule). We remark that several similar processes are known in the literature.

We use the Oettli-Prager theorem in the following form. Let be the residual error, and are given such that and . Let

where is set to -nak, is set to , if . Symbol denotes the component of the vector . If , then there exist a matrix and a vector for which

holds and

Moreover is the smallest number for which and exist with the above properties. The quantity measures the relative inverse error in terms of and . If for a given , and , the quantity is small, then the perturbed problem (and its solution) are close to the original problem (and its solution). In practice, the choice and is preferred

Denote by the approximate solution of and let be the residual error at the point . The precision of the approximate solution can be improved with the following method.

*
Iterative-Refinement(
)
*

1 2 3 45`WHILE`

6 Compute the approximate solution of with the -method. 7 8 9`DO`

`RETURN`

There are other variants of this process. We can use other linear solvers instead of the -method.

Let be the smallest bound of relative inverse error with

Furthermore let

**Theorem 12.23 **
**(Skeel)**
*If , then for sufficiently large we have*

This result often holds after the first iteration, i.e. for . Jankowski and Wozniakowski investigated the iterative refinement for any method which produces an approximate solution with relative error less than . They showed that the iterative refinement improves the precision of the approximate solution even in single precision arithmetic and makes method to be weakly stable.

**Exercises**

12.2-1 Prove Theorem 12.7.

12.2-2 Consider the linear systems and , where

and . Which equation is more sensitive to the perturbation of ? What should be the *relative error* of in the more sensitive equation in order to get the solutions of both equations with the same precision?

12.2-3 Let , and

Solve the linear systems for . Explain the results.

12.2-4 Let be a matrix and choose the band matrix consisting of the main and the neighbouring two subdiagonals of as a preconditioning matrix. How much does the *condition number* of improves if (i) is a random matrix; (ii) is a Hilbert matrix?

12.2-5 Let

and assume that is the common error bound of every component of . Give the sharpest possible error bounds for the solution of the equation and for the sum .

12.2-6 Consider the linear system with the approximate solution .

(i) Give an error bound for , if holds exactly and both and is nonsingular.

(ii) Let

and consider the solution of . Give (if possible) a relative error bound for the entries of such that the integer part of every solution component remains constant within the range of this relative error bound.