Download pdf - Matthias Petschow and Paolo Bientinesi - umu.sehpac.cs.umu.se/teaching/lsc-13/lect2-fpa.pdf · 2021. 2. 6. · Introduction to oating point arithmetic Matthias Petschow and Paolo

Introduction to floating point arithmetic

Matthias Petschow and Paolo Bientinesi

AICES, RWTH [email protected]

October 24th, 2013Aachen, Germany

Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 1 / 21

“Disclaimer”

Muller et al. - Handbook ofFloating-Point Arithmetic -595 pages!

Many topics not covered in thislecture (e.g., hardware/softwareimplementation of FPA, languagesupport, cleverly using FPA)

This lecture: basics of floatingpoint representation and arithmetic

Goal: Make you aware of the issuesarising in finite precisioncomputations

See the references below for a morethorough treatment of the topic


References

Article: “What every computer scientist should know

about floating-point arithmetic”, by David Goldberg

Article: “Lecture Notes on the Status of IEEE

Standard 754 for Binary Floating-Point

Arithmetic”, by William Kahan

Book: “Accuracy and Stability of Numerical

Algorithms”, by Nick Higham

Book: “Numerical Computing with IEEE Floating

Point Arithmetic”, by Michael Overton

IEEE 754-1985 and IEEE 754-2008:“Standard for Floating-Point Arithmetic”


Numerical Representation

Numbers123

29= (first 40 digits)

4.241379310344827586206896551724137931034 . . .

π =3.141592653589793238462643383279502884197 . . .

In general: Infinite number of digits

Computers

Finite memory ⇒ Approximated numbers


Numerical Representation

Numbers123

29= (first 40 digits)

4.241379310344827586206896551724137931034 . . .

π =3.141592653589793238462643383279502884197 . . .

In general: Infinite number of digits

Computers

Finite memory ⇒ Approximated numbers


FP representation

Floating point system F\{0}

(−1)s × d0.d1d2d3 . . . dp−1 × βewith base β, precision p, and

s ∈ {0, 1}di ∈ {0, . . . , β − 1}d0 6= 0

emin ≤ e ≤ emax

IEEE-754 specifications (β = 2, d0 = 1)

Single: p = 24, emin = −126, emax = 127Double: p = 53, emin = −1022, emax = 1023Additionally: ±0, ±∞, subnormal numbers, NaNs


IEEE single precision


IEEE single precision

(−1)s × (1 + f)× 2E−127

f = d1 · 2−1 + d2 · 2−2 + d3 · 2−3 + . . .+ d23 · 2−23

e = E − 127, biased exponent 1 ≤ E ≤ 254

Special values for E = 0:F = 0 : ±0F 6= 0 : subnormal numbers with d0 = 0 and implicit

exponent e = −126

Special values for E = 255:F = 0 : ±∞F 6= 0 : NaNs


IEEE double precision


Example questions

Question 1What is the largest finite floating point number is IEEE singleprecision?

Question 2What is the smallest positive normalized floating point number isIEEE single precision? What is the smallest positive number?

Question 3How many normalized IEEE single precision numbers and how manysubnormal numbers are there?


Example questions

Question 4

Given IEEE single/double precision (p = 24, 53), what is the gapbetween 1 and the next larger number?

Question 5Between an adjacent pair of nonzero IEEE single precision realnumbers, how many IEEE double precision numbers are there?

Question 6What is the largest integer u such that all integer in the interval[−u, u] are exactly representable in IEEE single precision format?What is the corresponding u for double precision?


Representation error

Relative rounding error

Let x ∈ R and |x| ∈ [ωmin, ωmax], then

x̄ = x(1 + δ) where |δ| ≤ u

andx̄ = x/(1 + δ̃) where |δ̃| ≤ u

with u denoting the unit roundoff.

ωmin = smallest normalized positive floating point number

ωmax = largest normalized positive floating point number

x̄ = [x] = floating point representation of x

If rounded to nearest float, x̄ = RN(x) and u = 2−p

For other rounding modes, u = 2−p+1


Example questions

TaskGiven a base 2 floating point format with precision p, show that ifx ∈ R lies in the normalized range thanRN(x) = x(1 + δ), with |δ| ≤ 2−p,where RN() rounds to the nearest floating point number.

Question

What if RN(x) is subnormal? Find an example where the above isnot true.

Question

What if we truncate – that is round to zero RZ() – instead ofrounding to the nearest number RN()?


Finite precision arithmetic4-digit representation

Inexact Arithmetic

123.4 +.5678 =

123.9678 =

Truncated 123.9Rounded 124.0

Associativity?

No!

Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)

Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4123.4 + (.5678 + .5432) = 124.5



Inexact Arithmetic

123.4 +.5678 =

123.9678 =

Truncated 123.9Rounded 124.0

Associativity?

No!





Inexact Arithmetic

123.4 +.5678 =

123.9678 =Truncated 123.9

Rounded 124.0

Associativity?

No!





Inexact Arithmetic

123.4 +.5678 =

123.9678 =Truncated 123.9

Rounded 124.0

Associativity?

No!





Inexact Arithmetic

123.4 +.5678 =

123.9678 =Truncated 123.9

Rounded 124.0

Associativity?

No!


Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4

123.4 + (.5678 + .5432) = 124.5



Inexact Arithmetic

123.4 +.5678 =

123.9678 =Truncated 123.9

Rounded 124.0

Associativity? No!




IEEE FP arithmetic

The standard floating point model

Given operation � ∈ {+,−,×, /} and x, y ∈ F0,

[x � y] = RN(x � y) = (x � y)(1 + δ)

with |δ| ≤ u, provided no underflow/overflow occurred.

RN() can be replaced by other rounding modes.

For RN() we have seen that u = 2−p.

Also, (x � y)(1 + δ) can be replaced by (x � y)/(1 + δ).

Similarly,√x = RN(x).

What about sin(x), cos(x), exp(x), log(x), ...?

Notation: Assuming a left-to-right evaluation, it holds

[x+ y + z/w] =[[

[x] + [y]]

+[[z]/[w]

]]Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 14 / 21

Example questions

Question 1

If x is a floating point number, is the floating point product [1× x]equal to x?

Question 2

If x 6= 0 is a (finite) floating point number, is the floating pointquotient [x/x] equal to 1?

Question 3If x is a floating point number, is the floating point product[0.5× x] equal to floating point quotient [x/2]?


Example questions

Question 4

Is it true that for all a, b ∈ R we have [a+ b] = [b+ a] and[a× b] = [b× a]?

Question 5

Let a = −1, b = 1, and c = 2−25. In IEEE single precisionarithmetic, what are the results of [a+ [b+ c]] and [[a+ b] + c]?

Question 6

Let a = b = 2513, and c = 2−1022. In IEEE double precisionarithmetic, what are the results of [a× [b× c]] and [[a× b]× c]?


Example question

Question 7

What is the result calling the function fun(2.0,0.0) defined below(using IEEE floating point values)?

fun(a, b) {res = 1/(1/a+ 1/b)return(res) }

Question 8What is a potential problem of the following program? How can itbe fixed?

hypot(a, b) {c = a2 + b2

c = sqrt(c)return(c) }


Example questions

Task

Calculating the roots of a quadratic polynomial with roots() asbelow can cause inaccurate results (e.g. a = c = 1, b = −108 fordouble precision). Write a code that avoids cancellation as much aspossible, e.g. by using r1r2 = c/a. Are there other potentialproblems, such as overflow etc.?

roots(a, b, c) {r1 = −b+

√b2−4ac2a

r2 = −b−√b2−4ac2a

return(r1, r2) }


Error Analysis

f : A→ B, y = f(x)

Es.: f(x) = x2 + sin(2 ∗ x)

x =π

123, f(x) =?

Exact arithmetic:( π

123

)2+ sin

(2 ∗ π

123

)= ...

Inexact arithmetic:

x x̂, f f̂ f̂(x̂) instead of f(x)


Error Analysis

f : A→ B, y = f(x)

Es.: f(x) = x2 + sin(2 ∗ x)

x =π

123, f(x) =?


123

)2+ sin

(2 ∗ π

123

)= ...

Inexact arithmetic:



Error Analysis

f : A→ B, y = f(x)

Es.: f(x) = x2 + sin(2 ∗ x)

x =π

123, f(x) =?


123

)2+ sin

(2 ∗ π

123

)= ...

Inexact arithmetic:



Example: Dot Product

x, y ∈ Rn; κ := xT y

κ :=((

(χ0ψ0 + χ1ψ1) + · · ·)

+ χn−2ψn−2

)+ χn−1ψn−1

κ̌ =

(((χ0ψ0(1 + ε

(0)∗ ) + χ1ψ1(1 + ε

(1)∗ ))(1 + ε

(1)+ ) + · · ·

)(1 + ε

(n−2)+ )

+χn−1ψn−1(1 + ε(n−1)∗ )

)(1 + ε

(n−1)+ )

=

n−1∑i=0

χiψi(1 + ε(i)∗ )

n−1∏j=i

(1 + ε(j)+ )

where ε

(0)+ = 0 and |ε(0)∗ |, |ε(j)∗ |, |ε(j)+ | ≤ u for j = 1, . . . , n− 1


Backward Stability

Let f : D → R be a map from the domain D to the range R.Let f̂ : D → R represent the execution in floating point arithmetic of a givenalgorithm A that computes f .

A is said to be backward stable iffor all x ∈ D there exists a perturbed input x̄ ∈ D, close to x, such thatf̂(x) = f(x̄).

I.e., the result computed in floating point arithmetic (f̂(x)) equals the resultobtained when the mathematically exact function (f) is applied to slightlyperturbed data (x̄).

The difference between x̄ and x, is the perturbation to the original input x.


Backward Stability

Let f : D → R be a map from the domain D to the range R.Let f̂ : D → R represent the execution in floating point arithmetic of a givenalgorithm A that computes f .

A is said to be backward stable iffor all x ∈ D there exists a perturbed input x̄ ∈ D, close to x, such thatf̂(x) = f(x̄).

I.e., the result computed in floating point arithmetic (f̂(x)) equals the resultobtained when the mathematically exact function (f) is applied to slightlyperturbed data (x̄).

The difference between x̄ and x, is the perturbation to the original input x.