Introduction to floating point arithmetic
Matthias Petschow and Paolo Bientinesi
AICES, RWTH [email protected]
October 24th, 2013Aachen, Germany
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 1 / 21
“Disclaimer”
Muller et al. - Handbook ofFloating-Point Arithmetic -595 pages!
Many topics not covered in thislecture (e.g., hardware/softwareimplementation of FPA, languagesupport, cleverly using FPA)
This lecture: basics of floatingpoint representation and arithmetic
Goal: Make you aware of the issuesarising in finite precisioncomputations
See the references below for a morethorough treatment of the topic
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 2 / 21
References
Article: “What every computer scientist should know
about floating-point arithmetic”, by David Goldberg
Article: “Lecture Notes on the Status of IEEE
Standard 754 for Binary Floating-Point
Arithmetic”, by William Kahan
Book: “Accuracy and Stability of Numerical
Algorithms”, by Nick Higham
Book: “Numerical Computing with IEEE Floating
Point Arithmetic”, by Michael Overton
IEEE 754-1985 and IEEE 754-2008:“Standard for Floating-Point Arithmetic”
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 3 / 21
Numerical Representation
Numbers123
29= (first 40 digits)
4.241379310344827586206896551724137931034 . . .
π =3.141592653589793238462643383279502884197 . . .
In general: Infinite number of digits
Computers
Finite memory ⇒ Approximated numbers
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 4 / 21
Numerical Representation
Numbers123
29= (first 40 digits)
4.241379310344827586206896551724137931034 . . .
π =3.141592653589793238462643383279502884197 . . .
In general: Infinite number of digits
Computers
Finite memory ⇒ Approximated numbers
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 4 / 21
FP representation
Floating point system F\{0}
(−1)s × d0.d1d2d3 . . . dp−1 × βewith base β, precision p, and
s ∈ {0, 1}di ∈ {0, . . . , β − 1}d0 6= 0
emin ≤ e ≤ emax
IEEE-754 specifications (β = 2, d0 = 1)
Single: p = 24, emin = −126, emax = 127Double: p = 53, emin = −1022, emax = 1023Additionally: ±0, ±∞, subnormal numbers, NaNs
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 5 / 21
IEEE single precision
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 6 / 21
IEEE single precision
(−1)s × (1 + f)× 2E−127
f = d1 · 2−1 + d2 · 2−2 + d3 · 2−3 + . . .+ d23 · 2−23
e = E − 127, biased exponent 1 ≤ E ≤ 254
Special values for E = 0:F = 0 : ±0F 6= 0 : subnormal numbers with d0 = 0 and implicit
exponent e = −126
Special values for E = 255:F = 0 : ±∞F 6= 0 : NaNs
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 7 / 21
IEEE double precision
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 8 / 21
Example questions
Question 1What is the largest finite floating point number is IEEE singleprecision?
Question 2What is the smallest positive normalized floating point number isIEEE single precision? What is the smallest positive number?
Question 3How many normalized IEEE single precision numbers and how manysubnormal numbers are there?
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 9 / 21
Example questions
Question 4
Given IEEE single/double precision (p = 24, 53), what is the gapbetween 1 and the next larger number?
Question 5Between an adjacent pair of nonzero IEEE single precision realnumbers, how many IEEE double precision numbers are there?
Question 6What is the largest integer u such that all integer in the interval[−u, u] are exactly representable in IEEE single precision format?What is the corresponding u for double precision?
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 10 / 21
Representation error
Relative rounding error
Let x ∈ R and |x| ∈ [ωmin, ωmax], then
x̄ = x(1 + δ) where |δ| ≤ u
andx̄ = x/(1 + δ̃) where |δ̃| ≤ u
with u denoting the unit roundoff.
ωmin = smallest normalized positive floating point number
ωmax = largest normalized positive floating point number
x̄ = [x] = floating point representation of x
If rounded to nearest float, x̄ = RN(x) and u = 2−p
For other rounding modes, u = 2−p+1
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 11 / 21
Example questions
TaskGiven a base 2 floating point format with precision p, show that ifx ∈ R lies in the normalized range thanRN(x) = x(1 + δ), with |δ| ≤ 2−p,where RN() rounds to the nearest floating point number.
Question
What if RN(x) is subnormal? Find an example where the above isnot true.
Question
What if we truncate – that is round to zero RZ() – instead ofrounding to the nearest number RN()?
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 12 / 21
Finite precision arithmetic4-digit representation
Inexact Arithmetic
123.4 +.5678 =
123.9678 =
Truncated 123.9Rounded 124.0
Associativity?
No!
Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)
Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4123.4 + (.5678 + .5432) = 124.5
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 13 / 21
Finite precision arithmetic4-digit representation
Inexact Arithmetic
123.4 +.5678 =
123.9678 =
Truncated 123.9Rounded 124.0
Associativity?
No!
Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)
Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4123.4 + (.5678 + .5432) = 124.5
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 13 / 21
Finite precision arithmetic4-digit representation
Inexact Arithmetic
123.4 +.5678 =
123.9678 =Truncated 123.9
Rounded 124.0
Associativity?
No!
Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)
Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4123.4 + (.5678 + .5432) = 124.5
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 13 / 21
Finite precision arithmetic4-digit representation
Inexact Arithmetic
123.4 +.5678 =
123.9678 =Truncated 123.9
Rounded 124.0
Associativity?
No!
Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)
Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4123.4 + (.5678 + .5432) = 124.5
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 13 / 21
Finite precision arithmetic4-digit representation
Inexact Arithmetic
123.4 +.5678 =
123.9678 =Truncated 123.9
Rounded 124.0
Associativity?
No!
Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)
Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4
123.4 + (.5678 + .5432) = 124.5
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 13 / 21
Finite precision arithmetic4-digit representation
Inexact Arithmetic
123.4 +.5678 =
123.9678 =Truncated 123.9
Rounded 124.0
Associativity? No!
Exact arithmetic:(123.4 + .5678) + .5432 =123.4 + (.5678 + .5432)
Inexact arithmetic:(123.4 + .5678) + .5432 = 124.4123.4 + (.5678 + .5432) = 124.5
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 13 / 21
IEEE FP arithmetic
The standard floating point model
Given operation � ∈ {+,−,×, /} and x, y ∈ F0,
[x � y] = RN(x � y) = (x � y)(1 + δ)
with |δ| ≤ u, provided no underflow/overflow occurred.
RN() can be replaced by other rounding modes.
For RN() we have seen that u = 2−p.
Also, (x � y)(1 + δ) can be replaced by (x � y)/(1 + δ).
Similarly,√x = RN(x).
What about sin(x), cos(x), exp(x), log(x), ...?
Notation: Assuming a left-to-right evaluation, it holds
[x+ y + z/w] =[[
[x] + [y]]
+[[z]/[w]
]]Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 14 / 21
Example questions
Question 1
If x is a floating point number, is the floating point product [1× x]equal to x?
Question 2
If x 6= 0 is a (finite) floating point number, is the floating pointquotient [x/x] equal to 1?
Question 3If x is a floating point number, is the floating point product[0.5× x] equal to floating point quotient [x/2]?
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 15 / 21
Example questions
Question 4
Is it true that for all a, b ∈ R we have [a+ b] = [b+ a] and[a× b] = [b× a]?
Question 5
Let a = −1, b = 1, and c = 2−25. In IEEE single precisionarithmetic, what are the results of [a+ [b+ c]] and [[a+ b] + c]?
Question 6
Let a = b = 2513, and c = 2−1022. In IEEE double precisionarithmetic, what are the results of [a× [b× c]] and [[a× b]× c]?
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 16 / 21
Example question
Question 7
What is the result calling the function fun(2.0,0.0) defined below(using IEEE floating point values)?
fun(a, b) {res = 1/(1/a+ 1/b)return(res) }
Question 8What is a potential problem of the following program? How can itbe fixed?
hypot(a, b) {c = a2 + b2
c = sqrt(c)return(c) }
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 17 / 21
Example questions
Task
Calculating the roots of a quadratic polynomial with roots() asbelow can cause inaccurate results (e.g. a = c = 1, b = −108 fordouble precision). Write a code that avoids cancellation as much aspossible, e.g. by using r1r2 = c/a. Are there other potentialproblems, such as overflow etc.?
roots(a, b, c) {r1 = −b+
√b2−4ac2a
r2 = −b−√b2−4ac2a
return(r1, r2) }
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 18 / 21
Error Analysis
f : A→ B, y = f(x)
Es.: f(x) = x2 + sin(2 ∗ x)
x =π
123, f(x) =?
Exact arithmetic:( π
123
)2+ sin
(2 ∗ π
123
)= ...
Inexact arithmetic:
x x̂, f f̂ f̂(x̂) instead of f(x)
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 19 / 21
Error Analysis
f : A→ B, y = f(x)
Es.: f(x) = x2 + sin(2 ∗ x)
x =π
123, f(x) =?
Exact arithmetic:( π
123
)2+ sin
(2 ∗ π
123
)= ...
Inexact arithmetic:
x x̂, f f̂ f̂(x̂) instead of f(x)
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 19 / 21
Error Analysis
f : A→ B, y = f(x)
Es.: f(x) = x2 + sin(2 ∗ x)
x =π
123, f(x) =?
Exact arithmetic:( π
123
)2+ sin
(2 ∗ π
123
)= ...
Inexact arithmetic:
x x̂, f f̂ f̂(x̂) instead of f(x)
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 19 / 21
Example: Dot Product
x, y ∈ Rn; κ := xT y
κ :=((
(χ0ψ0 + χ1ψ1) + · · ·)
+ χn−2ψn−2
)+ χn−1ψn−1
κ̌ =
(((χ0ψ0(1 + ε
(0)∗ ) + χ1ψ1(1 + ε
(1)∗ ))(1 + ε
(1)+ ) + · · ·
)(1 + ε
(n−2)+ )
+χn−1ψn−1(1 + ε(n−1)∗ )
)(1 + ε
(n−1)+ )
=
n−1∑i=0
χiψi(1 + ε(i)∗ )
n−1∏j=i
(1 + ε(j)+ )
where ε
(0)+ = 0 and |ε(0)∗ |, |ε(j)∗ |, |ε(j)+ | ≤ u for j = 1, . . . , n− 1
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 20 / 21
Backward Stability
Let f : D → R be a map from the domain D to the range R.Let f̂ : D → R represent the execution in floating point arithmetic of a givenalgorithm A that computes f .
A is said to be backward stable iffor all x ∈ D there exists a perturbed input x̄ ∈ D, close to x, such thatf̂(x) = f(x̄).
I.e., the result computed in floating point arithmetic (f̂(x)) equals the resultobtained when the mathematically exact function (f) is applied to slightlyperturbed data (x̄).
The difference between x̄ and x, is the perturbation to the original input x.
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 21 / 21
Backward Stability
Let f : D → R be a map from the domain D to the range R.Let f̂ : D → R represent the execution in floating point arithmetic of a givenalgorithm A that computes f .
A is said to be backward stable iffor all x ∈ D there exists a perturbed input x̄ ∈ D, close to x, such thatf̂(x) = f(x̄).
I.e., the result computed in floating point arithmetic (f̂(x)) equals the resultobtained when the mathematically exact function (f) is applied to slightlyperturbed data (x̄).
The difference between x̄ and x, is the perturbation to the original input x.
Matthias Petschow (AICES, RWTH Aachen) Floating Point October 24th, 2013 21 / 21