2

In my computer science course, we are doing a study of floating point numbers and how they are represented in memory. I already understand how they are represented in memory (the mantissa/significand, the exponent and its bias, and the sign bit), and I understand how floats are added and subtracted from each other (denormalization and all of that fun stuff). However, while looking over some study questions, I noticed something that I cannot explain.

When a float that cannot be precisely represented is added to itself several times, the answer is lower than we would mathematically expect, but when that same float is multiplied by an integer, the answer, comes out precisely to the correct number.

Here is an example from our study questions (the example is written in Java, and I have edited it down for simplicity):

float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;
float p = min + (width * count);

In this example, we are told that the result comes out to exactly 10.0. However, if we look at this problem as a sum of floats, we get a slightly different result:

float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;

for(float p=min; p <= max; p += width){
    System.out.printf("%f%n", p);
}

We are told that the final value of p in this test is ~9.999999 with a difference of -9.536743E-7 between the last value of p and the value of max. From a logical standpoint (knowing how floats work), this value makes sense.

The thing that I do not understand, though, is why we get exactly 10.0 for the first example. Mathematically, it makes sense that we would get 10.0, but knowing how floats are stored in memory, it does not make sense to me. Could anyone explain why we get a precise and exact value by multiplying an imprecise float with an int?

EDIT: To clarify, in the original study questions, some of the values are passed to the function and others are declared outside of the function. My example codes are shortened and simplified versions of the study question examples. Because some of the values are passed into the function rather than being explicitly defined as constants, I believe simplification/optimization at compile time can be ruled out.

Spencer D
  • 3,376
  • 2
  • 27
  • 43
  • Because the compiler reduced all of that to a constant value. Try making each statement a function and call then one after the other. – Amit Feb 19 '16 at 19:19
  • @Amit, my apologies, I should have made that clear in my question. Some of the values defined in the examples are passed in as variables to the function that computes the final result, so it would seem unlikely that it would be a compiler optimization. I was trying to simplify the code for this post, so I defined the values in the examples. I'll make an edit shortly to clarify that. – Spencer D Feb 19 '16 at 19:37
  • Unless you're about to surprise me with your edit, my comment (if you want I'll post it as an answer) will still hold. The compiler will optimize all the statements to the `max` value because all the statements do a back and forth calculation. – Amit Feb 19 '16 at 19:41
  • he can rule that out, by inputting the numbers on command line or from a file, so they're variables not compile time constants. – Rob11311 Feb 19 '16 at 19:48
  • I'm sure what they're trying to teach you, is that floating point is broken and needs care, because you can't represent decimal fractions exactly in the base 2 floating point format. It's avoiding 10 additions and doing 1 multiplication for better precision is the point. – Rob11311 Feb 19 '16 at 20:23
  • @Rob11311 - when you want to address someone in your comment, use the '@' sign and the username so they (me) are notified of the message. As you've already realized yourself by now, my point was accurate and inputting values from command line, file or even telepathy won't make a difference. – Amit Feb 19 '16 at 21:06

2 Answers2

3

First, some nitpicking:

When a float that cannot be precisely represented

There is no "float that cannot be precisely represented." All floats can be precisely represented as floats.

is added to itself several times, the answer is lower than we would mathematically expect,

When you add a number to itself several times, you can actually get something higher than you might expect. I will use C99 hexfloat notation. Consider f = 0x1.000006p+0f. Then f+f = 0x1.000006p+1f, f+f+f = 0x1.800008p+1f, f+f+f+f = 0x1.000006p+2f, f+f+f+f+f = 0x1.400008p+2f, f+f+f+f+f+f = 0x1.80000ap+2f, and f+f+f+f+f+f+f = 0x1.c0000cp+2f. However, 7.0*f = 0x1.c0000a8p+2, which rounds to 0x1.c0000ap+2f, less than f+f+f+f+f+f+f.

but when that same float is multiplied by an integer, the answer, comes out precisely to the correct number.

7 * 0x1.000006p+0f cannot be represented as an IEEE float. It therefore gets rounded. With the default rounding mode of round-to-nearest-with-ties-going-to-even, you get the closest float to your exact result when you do a single arithmetic operation like this.

The thing that I do not understand, though, is why we get exactly 10.0 for the first example. Mathematically, it makes sense that we would get 10.0, but knowing how floats are stored in memory, it does not make sense to me. Could anyone explain why we get a precise and exact value by multiplying an imprecise float with an int?

To answer your question, you get different results because you did different operations. It's a bit of a fluke that you got the "right" answer here.

Let's switch the numbers around. If I compute 0x1.800002p+0f / 3, I get 0x1.00000155555...p-1, which rounds to 0x1.000002p-1f. When I triple that, I get 0x1.800003p+0f, which rounds (since we break ties to even) to 0x1.800004p+0f. This is the same result as I'd get if I compute f+f+f in float arithmetic where f = 0x1.000002p-1f.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57
2

Because 1.0 + ((10.0 - 1.0) / 10.0) * 10.0 does only 1 calculation with inexact values, thus 1 rounding error, it is more accurate than doing 10 additions of float's representation of 0.9f. I think that is the principal which is intended to be taught in this example.

The key issue is that 0.1 cannot be represented exactly in floating point. So 0.9 has errors in it, which add up in the function loop.

The "exact" number, is probably shown so because of a clever output formatting routine. When I first used computers, they loved to put such numbers out in an absurd scientific fixed digit format, which was not human friendly.

I think to understand what's going on I'll find Koenig's Dr Dobbs blog post on this topic, it's an enlightening read, the series culiminates by showing how languages like perl, python & probably java make calculations look exact if they're precise enough.

Koenig's Dr Dobbs article on floating point

Even Simple Floating-Point Output Is Complicated

Don't be too surprised if fixed point arithmetic gets added to CPUs 5-10 years out, financial people like sums to be exact.

Rob11311
  • 1,396
  • 8
  • 10
  • Definitely a helpful answer that might explain what is going on here. However, `width` (a value of `~0.9`) is multiplied by 10, not `min` (a value of `1.0`). Nonetheless, the blog post you linked to has left me with an interesting idea. When `width` is added to itself 10 times, no denormalization occurs because the exponent of `width` is obviously the same. Then, when that result is added to `min`, it is large enough that denormalization does not occur there either. Thus, there is not any precision loss, so the value of width is "*precise enough*" to be considered exact. – Spencer D Feb 19 '16 at 20:13
  • 1
    The compiler can simplify the expression as written away. you have a division by count, followed by a multicplication. Similarly min + max - min, can be reduced to `float p = max;` Compilers are that clever these days. – Rob11311 Feb 19 '16 at 20:20
  • The hardware shuffles the numbers to scale them, as Amit pointed out a clever compiler can detect you are multiplying by the same value you divided by. The compiler will NOT want to do expensive conversions, of 10 to 10.0f at runtime. So to test that theory, you need to input count at runtime as a float to. It ought to be more accurate than 10 additions, but it can't be reduced to `float p = max;` at compile time. And thanks for ticking the answer, you have to rush often to get in first, then improve the answer, or you find someone else duplicates as you write it. – Rob11311 Feb 19 '16 at 20:29
  • Ahh, that is actually a very good point. I had not considered the fact that when `p` is calculated, we actually end up with `min + ((max - min)/count) * count` (which as you pointed out, simplifies to `p = max`). Now that just seems obvious and I cannot believe I overlooked that xD Thank you for pointing that out. – Spencer D Feb 19 '16 at 20:31
  • Well you're distracted, with experience, sometimes looking at assembler output to figure out what the C compiler really does, you learn sometimes it is suprisingly clever. I guess Java has some kind or readable output option to, though I never actually used it. – Rob11311 Feb 19 '16 at 20:39
  • it's not even when p is calculated, the problem is the compiler builds a tree for expressions, analyses them, can notice that it is re-multiplying immediately after the expression divides by the SAME value, so may eliminate them knowing they cancel out. It can notice same by rearranging `min + (max - min)` to `max - min + min`. One reason why compiler writers like the freedom to re-order expressions and function parameters. the compiler can work with the names for the values, they're `identifiers` so stored in the symbol table with attributes, like their type, const-ness and actual value. – Rob11311 Feb 19 '16 at 20:42
  • 1
    This whole discussion is a very long and verbose repeat of what I wrote (first, since you already mentioned duplicates) in the initial comment. Your answer, while informative, is irrelevant to the issue. I *also* explained how this can be validated in my original comment. – Amit Feb 19 '16 at 21:02
  • @Amit, I understood what you were suggesting "compiler reduced all of that to a constant value", but the student obviously did not. It definitely was not very clear immediately what you meant. Secondly, I was NOT addressing you specifically, when I suggested he test by using run-time set values. Thirdly, that is NOT the point of the exercise he was set, that problem is probably as a result of the rules for SO questions – Rob11311 Feb 19 '16 at 22:51