This title refers to a fast developing interdisciplinary area between mathematics, computers and applications. The subject is also often called as Computational Science and Engineering. Its aim is the efficient use of computer algorithms to solve engineering and scientific problems. One can say with a certain simplification that our subject is related to numerical mathematics, software engineering, computer graphics and applications. Here we can deal only with some basic elements of the subject such as the fundamentals of the *floating point computer arithmetic*, *error analysis*, the basic numerical methods of *linear algebra* and related mathematical software.

Let be the exact value and let be an approximation of (). The error of the approximation is defined by the formula (sometimes with opposite sign). The quantity is called an **(absolute) error (bound)** of approximation , if . For example, the error of the approximation is at most . In other words, the error bound of the approximation is . The quantities and (and accordingly and ) may be vectors or matrices. In such cases the absolute value and relation operators must be understood componentwise. We also measure the error by using matrix and vector norms. In such cases, the quantity is an error bound, if the inequality holds.

The absolute error bound can be irrelevant in many cases. For example, an approximation with error bound has no value in estimating a quantity of order . The goodness of an approximation is measured by the **relative error**
( for vectors and matrices), which compares the error bound to the approximated quantity. Since the exact value is generally unknown, we use the approximate relative error (). The committed error is proportional to the quantity , which can be neglected, if the absolute value (norm) of and is much greater than . The relative error is often expressed in percentages.

In practice, the (absolute) error bound is used as a substitute for the generally unknown true error.

In the *classical error analysis* we assume input data with given error bounds, exact computations (operations) and seek for the error bound of the final result. Let and be exact values with approximations and , respectively. Assume that the absolute error bounds of approximations and are and , respectively. Using the classical error analysis approach we obtain the following error bounds for the four basic arithmetic operations:

We can see that the division with a number near to can make the absolute error arbitrarily big. Similarly, if the result of subtraction is near to , then its relative error can become arbitrarily big. One has to avoid these cases. Especially the subtraction operation can be quite dangerous.

**Example 12.1 **Calculate the quantity with approximations and whose common absolute and relative error bounds are and , respectively. One obtains the approximate value , whose relative error bound is

that is . The true relative error is about . Yet it is too big, since it is approximately times bigger than the relative error of the initial data. We can avoid the subtraction operation by using the following trick

Here the nominator is exact, while the absolute error of the denominator is . Hence the relative error (bound) of the quotient is about . The latter result is in agreement with the relative error of the initial data and it is substantially smaller than the one obtained with direct subtraction operation.

The first order error terms of twice differentiable functions can be obtained by their first order *Taylor polynomial*:

The numerical sensitivity of functions at a given point is characterised by the **condition number**, which is the ratio of the relative errors of approximate function value and the input data (the *Jacobian matrix* of functions is denoted by at the point ):

We can consider the condition number as the magnification number of the input relative error. Therefore the functions is considered **numerically stable** (or **well-conditioned**) at the point , if is “small”. Otherwise is considered as **numerically unstable** (**ill-conditioned.**) The condition number depends on the point . A function can be well-conditioned at point , while it is ill-conditioned at point . The term “small” is relative. It depends on the problem, the computer and the required precision.

The condition number of matrices can be defined as the upper bound of a function condition number. Let us define the mapping by the solution of the equation (, ), that is, let . Then and

The upper bound of the right side is called the **condition number of the matrix**
. This bound is sharp, since there exists a vector such that .

Let us investigate the calculation of the function value . If we calculate the approximation instead of the exact value , then the forward error . If for a value the equality holds, that is, is the exact function value of the perturbed input data , then is called the **backward error**. The connection of the two concepts is shown on the Figure 12.1.

The continuous line shows exact value, while the dashed one indicates computed value. The analysis of the backward error is called the **backward error analysis**. If there exist more than one backward error, then the estimation of the smallest one is the most important.

An algorithm for computing the value is called **backward stable**, if for any it gives a computed value with small backward error . Again, the term “small” is relative to the problem environment.

The connection of the forward and backward errors is described by the approximate thumb rule

which means that

This inequality indicates that the computed solution of an ill-conditioned problem may have a big relative forward error. An algorithm is said to be **forward stable** if the forward error is small. A forward stable method is not necessarily backward stable. If the forward error and the condition number are small, then the algorithm is forward stable.

**Example 12.2 **Consider the function the condition number of which is . For the condition number is big. Therefore the relative forward error is big for .

The classical error analysis investigates only the effects of the input data errors and assumes exact arithmetic operations. The digital computers however are representing the numbers with a finite number of digits, the arithmetic computations are carried out on the elements of a finite set of such numbers and the results of operations belong to . Hence the computer representation of the numbers may add further errors to the input data and the results of arithmetic operations may also be subject to further rounding. If the result of operation belongs to , then we have the exact result. Otherwise we have three cases:

(i) rounding to representable (nonzero) number;

(ii) underflow (rounding to );

(iii) overflow (in case of results whose moduli too large).

The most of the scientific-engineering calculations are done in *floating point arithmetic* whose generally accepted model is the following:

**Definition 12.1 **
*The set of floating point numbers is given by*

*where*

*– is the base (or radix) of the number system,*

*– is the mantissa in the number system with base ,*

*– is the exponent,*

*– is the length of mantissa (the precision of arithmetic),*

*– is the smallest exponent (underflow exponent),*

*– is the biggest exponent (overflow exponent).*

The parameters of the three most often used number systems are indicated in the following table

The mantissa can be written in the form

We can observe that condition implies the inequality for the first digit . The remaining digits must satisfy (). Such arithmetic systems are called **normalized**. The zero digit and the dot is not represented. If , then the first digit is , which is also unrepresented. Using the representation (12.2) we can give the set in the form

where and .

**Example 12.3 **The set contains elements and its positive elements are given by

The elements of are not equally distributed on the real line. The distance of two consecutive numbers in is . Since the elements of are of the form , the distance of two consecutive numbers in is changing with the exponent. The maximum distance of two consecutive floating point numbers is , while the minimum distance is .

For the mantissa we have , since

Using this observation we can easily prove the following result on the range of floating point numbers.

**Theorem 12.2 **
*If , , then , where*

Let and denote any of the four arithmetic operations . The following cases are possible:

(1) (exact result),

(2) (arithmetic overflow),

(3) (arithmetic underflow),

(4) , (not representable result).

In the last two cases the *floating point arithmetic* is rounding the result to the nearest floating point number in . If two consecutive floating point numbers are equally distant from , then we generally round to the greater number. For example, in a five digit decimal arithmetic, the number is rounded to the number .

Let . It is clear that . Let . The denotes an element of nearest to . The mapping is called rounding. The quantity is called the rounding error. If , then the rounding error is at most . The quantity is called the **unit roundoff**. The quantity is the relative error bound of .

**Proof. **Without loss of generality we can assume that . Let , be two consecutive numbers such that

Either or holds. Since holds in both cases, we have

either or . It follows that

Hence , where . A simple arrangement yields

Since , we proved the claim.

Thus we proved that the relative error of the rounding is bounded in floating point arithmetic and the bound is the unit roundoff .

Another quantity used to measure the rounding errors is the so called the **machine epsilon**
(). The number is the distance of and its nearest neighbour greater than . The following algorithm determines in the case of binary base.

*
Machine-Epsilon
*

1 23`WHILE`

4 5`DO`

`RETURN`

In the MATLAB system .

For the results of floating point arithmetic operations we assume the following (standard model):

The IEEE arithmetic standard satisfies this assumption. It is an important consequence of the assumption that for the relative error of arithmetic operations satisfies

Hence the relative error of the floating point arithmetic operations is small.

There exist computer floating point arithmetics that do not comply with the standard model (12.4). The usual reason for this is that the arithmetic lacks a guard digit in subtraction. For simplicity we investigate the subtraction in a three digit binary arithmetic. In the first step we equate the exponents:

If the computation is done with four digits, the result is the following

from which the normalized result is . Observe that the subtracted number is unnormalised. The temporary fourth digit of the mantissa is called a guard digit. Without a guard digit the computations are the following:

Hence the normalized result is with a relative error of . Several CRAY computers and pocket calculators lack guard digits.

Without the guard digit the floating point arithmetic operations satisfy only the weaker conditions

Assume that we have a guard digit and the arithmetic complies with standard model (12.4). Introduce the following notations:

The following results hold:

where denotes the error (matrix) of the actual operation.

The standard *floating point arithmetics* have many special properties. It is an important property that the addition is not associative because of the rounding.

**Example 12.4 **If , , then using MATLAB and AT386 type PC we obtain

We can have a similar result on Pentium1 machine with the choice .

The example also indicates that for different (numerical) processors may produce different computational results for the same calculations. The commutativity can also be lost in addition. Consider the computation of the sum . The usual algorithm is the recursive summation.

*
Recursive-Summation(
)
*

1 2to 3`FOR`

4`DO`

`RETURN`

for . The recursive summation algorithm (and MATLAB) gives the result

If the summation is done in the reverse (increasing) order, then the result is

If the two values are compared with the exact result , then we can see that the second summation gives better result. In this case the sum of smaller numbers gives significant digits to the final result unlike in the first case.

The last example indicates that the summation of a large number of data varying in modulus and sign is a complicated task. The following algorithm of W. Kahan is one of the most interesting procedures to solve the problem.

*
Compensated-Summation(
)
*

1 2 3`FOR`

4`TO`

5 6 7 8`DO`

`RETURN`

The ANSI/IEEE Standard 754-1985 of a binary () *floating point arithmetic system* was published in 1985. The standard specifies the basic arithmetic operations, comparisons, rounding modes, the arithmetic exceptions and their handling, and conversion between the different arithmetic formats. The square root is included as a basic operation. The standard does not deal with the exponential and transcendent functions. The standard defines two main floating point formats:

In both formats one bit is reserved as a sign bit. Since the floating point numbers are normalized and the first digit is always , this bit is not stored. This hidden bit is denoted by the “ ” in the table.

The arithmetic standard contains the handling of arithmetic exceptions.

(The numbers of the form , are called **subnormal numbers.**) The IEEE arithmetic is a closed system. Every arithmetic operations has a result, whether it is expected mathematically or not. The exceptional operations raise a signal and continue. The arithmetic standard conforms with the standard model (12.4).

The first hardware implementation of the IEEE standard was the Intel 8087 mathematical coprocessor. Since then it is generally accepted and used.

*Remark.* In the single precision we have about 7 significant digit precision in the decimal system. For double precision we have approximately 16 digit precision in decimals. There also exists an extended precision format of 80 bits, where and the exponential has bits.

**Exercises**

12.1-1 The measured values of two resistors are and . We connect the two resistors parallel and obtain the circuit resistance . Calculate the relative error bounds of the initial data and the approximate value of the resistance . Evaluate the absolute and relative error bounds and , respectively in the following three ways:

(i) Estimate first using only the absolute error bounds of the input data, then estimate the relative error bound .

(ii) Estimate first the relative error bound using only the relative error bounds of the input data, then estimate the absolute error bound .

(iii) Consider the circuit resistance as a two variable function .

12.1-2 Assume that is calculated with the absolute error bound . The following two expressions are theoretically equal:

(i)

(ii) .

Which expression can be calculated with less relative error and why?

12.1-3 Consider the arithmetic operations as two variable functions of the form , where .

(i) Derive the *error bounds* of the arithmetic operations from the error formula of two variable functions.

(ii) Derive the *condition numbers* of these functions. When are they ill-conditioned?

(iii) Derive *error bounds* for the power function assuming that both the base and the exponent have errors. What is the result if the exponent is exact?

(iv) Let , and . Determine the smallest and the greatest value of as a function of such that the *relative error bound* of should be at most .

12.1-4 Assume that the number () is calculated in a 24 bit long mantissa and the exponential function is also calculated with 24 significant bits. Estimate the *absolute error* of the result. Estimate the *relative error* without using the actual value of .

12.1-5 Consider the *floating point number set*
and show that

(i) Every arithmetic operation can result *arithmetic overflow*;

(ii) Every arithmetic operation can result *arithmetic underflow*.

12.1-6 Show that the following expressions are *numerically unstable* for :

(i)

(ii) ;

(iii) .

Calculate the values of the above expressions for and estimate the error. Manipulate the expressions into *numerically stable* ones and estimate the error as well.

12.1-7 How many elements does the set have? How many *subnormal numbers* can we find?

12.1-8 If , then and equality holds if and only if . Is it true numerically? Check the inequality experimentally for various data (small and large numbers, numbers close to each other or different in magnitude).