Float data type uncertainty

Question

I am doing a numerical analysis of a math software I developed. I want to identify what is the uncertainty of my result. Being f() my method and x an input value, I want to identify y of my result as f(x) +/- y. My f() method has multiple operations between float variables. To study the error propagation occurred in f(), I have to apply the Statistical Propagation of Uncertainty formulas and in order to do so I have to know the uncertainty of a float variable.

I do understand the architecture of a float variable as specified in the IEEE 754 standard and the rounding error converting a decimal value to float inherent to the latter.

From what I understood of the literature, the FLT_EPSILON macro in http://www.cplusplus.com/reference/cfloat/ defines my y value but this quick test proves it wrong:

float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092896e-7f;
float f3 = 1.234567f + 1.192092895e-7f;

printf("Inicial:\t%f\n", f1);
printf("Inicial:\t%f\n", f2);
printf("Inicial:\t%f\n\n", f3);

Output:

Inicial:  1.234567
Inicial:  1.234567
Inicial:  1.234567

When the expected output should be:

Inicial:  1.234567
Inicial:  1.234568 <---
Inicial:  1.234567

What is that I am wrong about? Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?

EDIT: My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?

Ore that when the `float` values are passed to `printf()`, they are automatically converted to `double`. — Jonathan Leffler, Mar 23 '18 at 16:52
@JonathanLeffler. Why does it matter? The addition happens first. — Mad Physicist, Mar 23 '18 at 17:59
@MadPhysicist: Apart from the fact that an iPhone managed to spell-mangle 'Note' into 'Ore' (I didn't bother to look; grump!), at one level it doesn't matter. At another level, that conversion is significant because it means you can use `double`-sized floating point precision to look at the values passed to `printf()` — it could make sense to use `printf("%16.12f\n", f2);` to see more of the full-precision of the value as a `double` (because it is a `double` when it is processed by `printf()`. — Jonathan Leffler, Mar 23 '18 at 18:03
@JonathanLeffler: The number of digits the C implementation can correctly produce is a function of (the quality of) the C implementation, not the specific floating-point type. In the various places the C standard specifies conversion to decimal, the standard recommends that results up to `DECIMAL_DIG` digits be correctly rounded, and `DECIMAL_DIG` is a function of the implementation’s widest supported floating-point type. So a `double`, for example, ought to be convertible with a number of digits sufficient to distinguish `long double` even though it is passed to `printf` as only a `double`. — Eric Postpischil, Mar 23 '18 at 18:54

Eric Postpischil · Accepted Answer · 2018-03-23T17:16:50.620

Propagation of uncertainty is from the field of statistics and refers to how uncertainties in inputs affect mathematical functions of them. The analysis of errors that occur in computational arithmetic is numerical analysis.

FLT_EPSILON is not a measure of uncertainty or error in floating-point results. It is the distance between 1 and the next value representable in the float type. Hence, it is the size of steps between representable numbers at the magnitude of 1.

When you convert a decimal numeral to floating-point, the rounding error that results may have a magnitude of up to ½ the step size when the common round-to-nearest mode is used. The reason the bound is ½ the step size is that for any number x (within the finite domain of the floating-point format), there is a representable value within ½ the step size (inclusive). This is because, if there is a representable number more than ½ the step size in one direction, there is a representable number less than ½ the step size in the other direction.

The step size varies with the magnitudes of the numbers. With binary floating-point, it doubles at 2, and again at 4, then 8, and so on. Below 1, it halves, and again at ½, ¼, and so on.

When you perform floating-point arithmetic operations, the rounding that occurs in the computation may compound or cancel previous errors. There is no general formula for the final error.

The two numerals use used in your sample code, 1.192092897e-7f and 1.192092896e-7f, are so close together that they convert to the same float value, 2⁻²³. That is why there is no difference in your f2 and f3.

There is a difference between f1 and f2, but you did not print enough digits to display it.

You ask “Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?”, but your code does not contain x - FLT_EPSILON.

Re: “My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?” This is trivially satisfied by y = 0. Did you mean to ask what is the largest value of y that satisfies the condition? That is a bit complicated.

The step size for a number x is called the ULP of x, which we may consider as a function ULP(x). ULP stands for Unit of Least Precision. It is the place value of the least digit in the floating-point representation of x. It is not a constant; it is a function of x.

For most values representable in a floating-point format, the largest y that satisfies your condition is ½ ULP(x) of the least digit in the floating-point representation of x is even and, if the digit is odd, it is just under ½ ULP(x). This complication arises from the rule that the results of arithmetic are rounded to the nearest representable value and, in case of a tie, the value with the even low digit is chosen. Thus, adding ½ ULP(x) to x will yield a tie that will round to x if the low digit is even, but will not round to x if the low digit is odd.

However, for x that are on the boundary where the ULP changes, the largest y that satisfies your condition is ¼ ULP(x). This is because, just below x (in magnitude), the step size changes, and the next number lower than x is half of x’s step size away instead of the usual full step size. So you can only go halfway toward that value before changing the result of the subtraction, so the most y can be is ¼ ULP(x).

In `float.h`: `#define FLT_EPSILON 1.192092896e-7f`. `"Smallest such that 1.0+FLT_EPSILON !=1.0"` Taking that into account `1.0+1.192092895e-7f==1.0` should be true. Also I just noticed the float values that I was adding are wrong, they should be `1.192092896e-7f` and `1.192092895e-7f` respectively. — Pedro Pereira, Mar 23 '18 at 17:17
How I see, thanks for the explanation. But what is the function of the ULP? — Pedro Pereira, Mar 23 '18 at 17:22
@PedroPereira: First, `1.192092895e-7f` is not smaller than `FLT_EPSILON` and is not smaller than `1.192092897e-7f`. In source code, the string “1.192092895e-7f” is a literal representing a floating-point value. The value it represents is not the mathematical number 1.192092895*10^-7 but is the result of converting that number to the `float` format. In that conversion, the number is rounded to the nearest representable value. Because it is so close to the nearest representable value, which is 1.1920928955078125*10^-7, it is rounded to that value, which equals `FLT_EPSILON`. — Eric Postpischil, Mar 23 '18 at 17:38
@PedroPereira: Second, “Smallest such that 1.0+FLT_EPSILON !=1.0” is not a correct description of `FLT_EPSILON`, and I do not know where that quote comes from. A correct description of `FLT_EPSILON` is “the difference between 1 and the least value greater than 1 that is representable in the given floating point type,” with the floating point type being `float` in this case. That comes from the C standard. This implies that `1 + FLT_EPSILON != 1`, but it is not the smallest such value that accomplishes that, because you can have x = .75 * `FLT_EPSILON`, and then `1+x != 1` due to rounding. — Eric Postpischil, Mar 23 '18 at 17:41
@PedroPereira: What is true is that if `x1` is the smallest value larger than 1 that results from `1+x` for any `x`, then `x1-x` equals `FLT_EPSILON`. — Eric Postpischil, Mar 23 '18 at 17:43
@PedroPereira: For IEEE-754 basic 32-bit binary floating-point, ULP(x) is the greater of 2^(floor(log2(|x|)-23)) and 2^-149, provided x is less than 2^128. — Eric Postpischil, Mar 23 '18 at 17:44
Thank you for your time, I think I understood what you are telling me. So I can not know the largest value I can add to a number that does not change it because the ULP function is not linear. Right? Keeping that in mind how can I study the numerical analysis of my methods if I can not put a number on the uncertainty of float type values? — Pedro Pereira, Mar 23 '18 at 17:51
I feel like I am a step back (or multiple) than the point I started — Pedro Pereira, Mar 23 '18 at 17:52
@PedroPereira: Numerical analysis is a subject of considerable size. There are textbooks and research papers on it. There is no general answer. You would have to describe your software in some detail in order for anybody to analyze its numerical properties. Sometimes people do statistical estimates of error by varying the inputs slightly and seeing how much the results vary and/or comparing them with results calculated with more precision. These do not amount to proofs but can give ideas about the nature of errors in particular algorithms. — Eric Postpischil, Mar 23 '18 at 18:08
That was the method I was following to measure the error but my dissertation professor (I do not know the term in english) wants me to be as scientifically thorough as possible. Long story short, I need to do this analysis to prove him what method is better for the numerical analysis, if a practical approach or a theoretic one. — Pedro Pereira, Mar 23 '18 at 18:18

Beyondo · Answer 2 · 2018-03-23T16:52:57.483

float f1 = 1.234567f;
float f2 = f1 + 1.192092897e-7f;
float f3 = f1 + 1.192092896e-7f;

printf("Inicial:\t%.20f\n", f1);
printf("Inicial:\t%.20f\n", f2);
printf("Inicial:\t%.20f\n\n", f3);

Output:

Inicial:        1.23456704616546630000
Inicial:        1.23456716537475590000
Inicial:        1.23456716537475590000

No, your expectation is wrong
In the first printf call, you're printing the variable f1 with no effect which is just 1.234567f.

sg7 · Answer 3 · 2018-03-23T16:48:59.750

Float is a 32 bit IEEE 754 single precision Floating Point Number: 1 bit for the sign, 8 bits for the exponent, and 23* for the value, i.e. float has 7 decimal digits of precision.

Increase the printf number of printed digits to see more but after 7 digits its just noise:

#include <stdio.h>

int main(void) {

 float f1 = 1.234567f;
 float f2 = 1.234567f + 1.192092897e-7f;
 float f3 = 1.234567f + 1.192092896e-7f;

 printf("Inicial:\t%.16f\n", f1);
 printf("Inicial:\t%.16f\n", f2);
 printf("Inicial:\t%.16f\n\n", f3);

 return 0;
}

Output:

Inicial:        1.2345670461654663                                                                                                           
Inicial:        1.2345671653747559                                                                                                           
Inicial:        1.2345671653747559

My question does not involve the printf problem, this is not in the scope of the answer I am looking for. — Pedro Pereira, Mar 23 '18 at 17:19

Float data type uncertainty

3 Answers3