Float type variable uncertainty

Question

I developed an image processing software and I need to do a numerical analysis of it, considering the error propagation associated to its operations and the uncertainty of float type variables caused by the inherent rounding up that happens with this type of variables.

Considering the IEEE 754 standard the machine epsilon for the float type variables is 1.19e-07. From what I understood, this value is the distance to the nearest representable float.

I did some testing to find if this is true by adding a float value to this epsilon as such: x + epsilon == x. This notion does not hold for every value of the float range, which is understandable since great values of floats have more uncertainty associated with them caused by the rounding and the limited number of bits used to represent them.

My question is what is the uncertainty associated to a float value in such a way that (x + y) || (x - y) == x being the float value x and the float uncertainty y.

It might be my lack of knowledge about the english language but I can not seem to understand the literature about this topic.

If someone could be as detailed as possible can you explain me the error in a simple operation such as the following?

float result = valA * 0.587f + valB * 0.331f;

If I knew the uncertainty of a float type variable this error could be simply calculated with this formulas, right?

I had asked a question similar to this and a user provided me with a great resource. https://stackoverflow.com/questions/49471943/floating-point-arithmetic-varies-between-g-and-clang/49471977#49471977 This might be a good place to start. — 138, Mar 25 '18 at 19:35
The accuracy of the floating point types is relative to the magnitude of the values, as the accuracy is measured in number of digits. — Lasse V. Karlsen, Mar 25 '18 at 19:36
Those formulas look like they may be statistical or probability distribution formula. Analyzing computer arithmetic is numerical analysis, a different field. — Eric Postpischil, Mar 26 '18 at 00:54
Do you have any bounds on what `valA`, `valB`, and `result` can be? — Eric Postpischil, Mar 26 '18 at 01:41
@EricPostpischil These variables will belong to the integer interval `[0-255]`, but regarding some of the constants used throughout the methods will have 3 decimals like `0.587f` and `0.331f. — Pedro Pereira, Mar 26 '18 at 08:32
@EricPostpischil Also, I read here (http://blog.reverberate.org/2014/09/what-every-computer-programmer-should.html) that we can normalize the precision of a float to a percentage instead of actual unit values. So according to this any float with 24 bit significand has an uncertainty of 0.00001% of its value. Can I make a numerical analysis of float values with this percentage uncertainty? — Pedro Pereira, Mar 26 '18 at 10:59

Eric Postpischil · Accepted Answer · 2018-03-26T23:21:29.713

Introduction

This answer presents an initial examination of the error in:

float result = valA * 0.587f + valB * 0.331f;

In this answer, values in the floating-point format and expressions computed with the floating-point format will be represented with code style, as in z or x * y. Mathematical variables will use italic and will not be in code style, as in z or x • y.

I assume that all arithmetic is done with IEEE-754 basic 32-bit binary floating-point. This format is commonly used for the float type, although some programming language implementations mix precisions, possibly using double or other precision while evaluating expressions of float type. I also assume all arithmetic is done using the round-to-nearest mode, with ties to the number with the even low bit.

This format as 24 bits in the significand, so the unit of least precision (ULP) is normally 2⁻²³ times the value of the most significant bit. This is the step size between representable values. For example, for values in [1, 2), the ULP is 2⁻²³. For values in [128, 256), the ULP is 2⁷•2⁻²³ = 2⁻¹⁶. (For subnormal values, the significand has fewer bits. The lowest the ULP can be is 2⁻¹⁴⁹. Beyond the largest finite representable value, the step size to the next representable value is infinite. However, in this question, only values of modest value are involved, so we can neglect infinity.)

The result of computing any operation with correct rounding is at most ½ ULP away from the correct answer. That is, if we compute z = x + y, for example, the computed result z differs from the exact mathematical result z = x + y by at most ½ ULP of z. (Although z is an exact mathematical result with infinite precision, we use its magnitude to determine which range it falls in in the floating-point format, which determines what we mean by the ULP of z.) The reason the error is at most ½ ULP is that, if the two representable values nearest z are z0 and z1, we must have z0 ≤ z ≤ z1, and if ½ ULP < z1 − z, then z − z0 < ½ ULP (because z1 − z0 = 1 ULP, by definition of an ULP.) Therefore, in choosing the nearest representable value, we would pick the closer of z0 and z1, so the error never exceeds ½ ULP.

As stated in a comment, valA, valB, and result are in [0, 256).

Symbolic Analysis

By the time we start computing valA * 0.587f + valB * 0.331f, valA and valB have some errors from previous operations. That is, ideally, using exact mathematics, we would have computed some numbers A and B, but instead the computer calculated valA and valB, and the differences are eA = valA − A and eB = valB - B.

Ideally, we would like to compute the number R such that R is, using exact mathematics, A • .587 + B • .331. When we use computer arithmetic:

0.587f will be converted from .587 to the floating-point format, and the result will have some rounding error e0, so the result is 0.587f = 0.587 + e0.
0.331f will be converted from .331 to the floating-point format, and the result will have some rounding error e1, so the result is 0.331f = .331 + e1.
valA * 0.587f will be computed with some error e2, so the result will be valA * 0.587f = valA • 0.587f + e2.
valB * 0.331f will be computed with some error e3, so the result will be valB * 0.331f = valB • 0.331f + e3.
The two products will be added, with some error e4, so the result will be valA * 0.587f + valB * 0.331f = valA * 0.587f + valB * 0.331f + e4.

Now we can substitute the expressions, so:

valA * 0.587f + valB * 0.331f = (valA • 0.587f + e2) + (valB • 0.331f + e3 + e4.
valA * 0.587f + valB * 0.331f = (valA • (0.587 + e0) + e2) + (valB • (.331 + e1) + e3) + e4.
valA * 0.587f + valB * 0.331f = ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4.

With this, we have expressed the computed result, valA * 0.587f + valB * 0.331f, as an exact mathematical expression (of variables with incompletely known values), ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4.

Numerical Analysis

Next, we can place some bounds on the errors. e0 and e1 are easy, their magnitudes are at most ½ ULP of .587 and .331, respectively. .587 is in [½, 1), so its ULP is 2⁻²⁴, and .331 is in [¼, ½), so its ULP is 2⁻²⁵. So |e0| ≤= 2⁻²⁵, and |e1| ≤= 2⁻²⁶.

Bounds on e2 and e3 depend on the magnitudes of valA * 0.587f and valB * 0.331f. Since val < 256, valA * 0.587f < 256, so its ULP is at most 2⁻¹⁶, and |e2| ≤ 2⁻¹⁷. With valB, we can see that valB * 0.331f < 128, so the ULP of valB * 0.331f is at most 2⁻¹⁷, and |e3| ≤ 2⁻¹⁸.

Finally, we have the error e4 that occurs in the final addition of valA * 0.587f + valB * 0.331f. We have assumed this is less than 256, so its ULP is at most 2⁻¹⁶, and |e4| ≤ 2⁻¹⁷.

Looking at the mathematical expression of the computed result, ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4, we can see that the largest possible error occurs when e0, e1, e2, e3, and e4 have the greatest values (unless eA or eB is huge and negative, which we assume not to be true). So we can substitute the upper bounds we have prepared for these errors:

((A + eA) • (0.587 + 2⁻²⁵) + 2⁻¹⁷) + ((B + eB) • (.331 + 2⁻²⁶) + 2⁻¹⁸) + 2⁻¹⁷.

In the interests of time, I evaluated this with Maple. (It might be a bit more illuminating to expand the expression manually and retain some of the factors rather than consolidating coefficients into single numbers, but I leave that to the reader.) The result is:

2462056573/4194304000 • A + 2462056573/4194304000 • eA + 5/262144 + 2776629373/8388608000 • B + 2776629373/8388608000 • eB.

The ideal result is A • .587 + B • .331. When we subtract that from the above, the result is a bound on the error in the computation:

1/33554432 • A + 2462056573/4194304000 • eA + 5/262144 + 1/67108864 * B + 2776629373/8388608000 • eB.

Since A < 256 and B < 256, we can substitute 256 for A and for B, yielding:

1/32768 + 2462056573/4194304000 • eA + 2776629373/8388608000 • eB.

Reversing a bit of Maple’s arithmetic, that is:

2⁻¹⁵ + (.587 + 2⁻²⁵) • eA + (.331 + 2⁻²⁶) • eB.

So, that is an upper bound on the error in valA * 0.587f + valB * 0.331f. It could possibly be reduced more with additional information about the relationship between valA and valB. Also, the errors in converting .587 and .331 to float are exactly known, so those should be used instead of the bounds I used as illustration in this answer.

One also needs to establish a lower bound on the error. The rounding errors could be negative, and we have to ask what the lowest possible value of ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4 is. As I am out of time for now, that is left for the reader.

Addendum

e0 is 13/1048576000. e1 is 1/4194304000. Then the upper bound on the error can be reduced to 731/32768000 + 4924113/8388608 • eA + 11106517/33554432 • eB, which is:

.731•2⁻¹⁵ + (.587 + .013•2⁻²⁰) • eA + (.331 + .001•2⁻²²) • eB.

Could not have asked for a more complete answer! Thank you for your time! — Pedro Pereira, Mar 26 '18 at 15:35
If anyone is trying to implement this (e.g. to calculate the propagation of numerical uncertainty in an algorithm) two points to note: 1. IEEE-754 double precision (64-bit) floating point numbers have 52 bits in the mantissa so use 52 in place of the 24 above when working with doubles; 2. From python 3.9 onward, there is a built in function to get the ULP of a number, x, which is math.ulp(x) so use that rather than writing your own (handles infinity, zero, etc.). See docs here: https://docs.python.org/3.9/library/math.html#math.ulp — Biggsy, Jun 15 '21 at 13:49
@Biggsy: The significand of an IEEE-754 “double” is 53 bits, not 52. 52 are stored in the primary significand field, and 1 is stored via the exponent field. Numerically, the value behaves as if it has a 53-bit signficand. Also, “significand” is the preferred term for the fraction portion of a floating-point representation. “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear; mantissas are logarithmic. — Eric Postpischil, Jun 15 '21 at 13:54

Float type variable uncertainty

1 Answers1

Introduction

Symbolic Analysis

Numerical Analysis

Addendum