Computing floating point accuracy (K&R 2-1)

Question

I found Stevens Computing Services – K & R Exercise 2-1 a very thorough answer to K&R 2-1. This slice of the full code computes the maximum value of a float type in the C programming language.

Unluckily my theoretical comprehension of float values is quite limited. I know they are composed of significand (mantissa.. ) and a magnitude which is a power of 2.

#include <stdio.h>
#include <limits.h>
#include <float.h>

main()
{
    float flt_a, flt_b, flt_c, flt_r;

    /* FLOAT */
    printf("\nFLOAT MAX\n");
    printf("<limits.h> %E  ", FLT_MAX);

    flt_a = 2.0;
    flt_b = 1.0;
    while (flt_a != flt_b) {
        flt_m = flt_b;           /* MAX POWER OF 2 IN MANTISSA */     
        flt_a = flt_b = flt_b * 2.0;
        flt_a = flt_a + 1.0;
    }
    flt_m = flt_m + (flt_m - 1); /* MAX VALUE OF MANTISSA */

    flt_a = flt_b = flt_c = flt_m;
    while (flt_b == flt_c) {
        flt_c = flt_a;        
        flt_a = flt_a * 2.0;
        flt_b = flt_a / 2.0;
    }
    printf("COMPUTED %E\n", flt_c);
}

I understand that the latter part basically checks to which power of 2 it's possible to raise the significand with a three variable algorithm. What about the first part?

I can see that a progression of multiples of 2 should eventually determine the value of the significand, but I tried to trace a few small numbers to check how it should work and it failed to find the right values...

======================================================================

What are the concepts on which this program is based upon and does this program gets more precise as longer and non-integer numbers have to be found?

@TonyK, the `flt_m` computation is perfectly appropriate. It computes one less than twice `flt_m`, with care to avoid loss of precision. And there's nothing wrong with the style, given that it's written in K&R C. — John Bollinger, Feb 09 '15 at 19:52
@JohnBollinger With care to avoid loss of precision‽ Absolutely not, `flt_m + flt_m - 1` would be the nearest float to 2 * `flt_m` - 1 for all values of `flt_m`. Instead, the assignment is written this way so as to be self-documenting (“MAX VALUE OF MANTISSA”). — Pascal Cuoq, Feb 09 '15 at 20:09
@PascalCuoq, I suppose we're both speculating. However, I find the construction anything *but* self-documenting. The only reason for it that makes sense to me is to avoid any intermediate result (e.g. 2 * `flt_m`) that has worse than unit precision. It is plausible that having such a value as one input, an adder might convert the other to the same scale before computing their sum. An implementation that did so (without increasing the precision) would compute a result different from the desired one. — John Bollinger, Feb 09 '15 at 20:46
@JohnBollinger in binary floating-point, `y + y` is exact unless it overflows (in which case `y + y - 1.0` is still the best approximation for the mathematical value unless the exponent range is non-standard). By contrast, `y + (y - 1.0)` gets some results horribly wrong in the binade in which the ULP is 2 (though that's not the case in this program) — Pascal Cuoq, Feb 09 '15 at 21:10
@PascalCuoq, IEEE 754 arithmetic may guarantee such behavior, but C does not mandate that implementations comply with IEEE 754 (and indeed, some present and more past implementations don't). If we could assume IEEE 754, then what would be the point of the program? The characteristics of the IEEE 754 binary single-precision format are well known; we don't need to compute them. — John Bollinger, Feb 09 '15 at 21:31
@JohnBollinger Well the program is an **exercise** in a **C-learning book**, so that may be its point. `flt_m + flt_m` is exact even if computed in higher precision, so surely `flt_m + (flt_m - 1)` is not written this way “with care to avoid loss of precision”. Unless you are claiming that all bets are off with respect to floating-point arithmetic, but then what makes you think that `flt_m + (flt_m - 1)` works any better than any alternative? — Pascal Cuoq, Feb 09 '15 at 23:20
@PascalCuoq, you're right that `flt_m + (flt_m - 1)` is not guaranteed to avoid precision loss -- it protects only against one plausible precision-loss scenario for non-IEEE arithmetic. It nevertheless *does* protect against precision loss in that scenario, so the code presented works correctly in more environments than code that uses `2.0 * flt_m - 1.0`, even if it isn't guaranteed to work in every environment. That's still the best explanation I've considered for the *author's intent* in choosing that expression. I certainly don't buy self documentation, but I'd consider alternatives. — John Bollinger, Feb 10 '15 at 17:37

John Bollinger · Accepted Answer · 2015-02-09T20:33:08.290

The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you can express all the integers from 0 through 2^n - 1, but not 2^n. The floating-point representation of 2^n must therefore have an exponent large enough that the (binary) units digit is not significant.

By that same token, having found the first power of 2 whose float representation has worse than unit precision, the maximim float value that does have unit precision is one less. That value is recorded in variable flt_m.

The second loop then tests for the maximum exponent by starting with the maximum unit-precision value, and repeatedly doubling it (thereby increasing the exponent by 1) until it finds that the result cannot be converted back by halving it. The maximum float is the value before that final doubling.

Do note, by the way, that all the above supposes a base-2 floating-point representation. You are unlikely to run into anything different, but C does not actually require any specific representation.

With respect to the second part of your question,

does this program gets more precise as longer and non-integer numbers have to be found?

the program takes care to avoid losing precision. It does assume a binary floating-point representation such as you described, but it will work correctly regardless of the number of bits in the significand or exponent of such a representation. No non-integers are involved, but the program already deals with numbers that have worse than unit precision, and with numbers larger than can be represented with type int.

Given `FLT_EVAL_METHOD` may have a value of 1 or 2, all `float` operations occur at `double` or `long double` precision/range failing OP's method. — chux - Reinstate Monica, Feb 09 '15 at 20:17
@chux, even if the computations are performed at higher precision, that higher precision is lost when the result is assigned to a variable of type `float`. Implementations are required to behave as if that happens, even if they optimize away some of the actual assignments. — John Bollinger, Feb 09 '15 at 20:24
C11 §5.2.4.2.2 9 " Except for assignment and cast (which remove all extra range and precision), ..." supports your insightful correction. — chux - Reinstate Monica, Feb 09 '15 at 20:31

Computing floating point accuracy (K&R 2-1)

1 Answers1