In my computer science course, we are doing a study of floating point numbers and how they are represented in memory. I already understand how they are represented in memory (the mantissa/significand, the exponent and its bias, and the sign bit), and I understand how floats are added and subtracted from each other (denormalization and all of that fun stuff). However, while looking over some study questions, I noticed something that I cannot explain.
When a float that cannot be precisely represented is added to itself several times, the answer is lower than we would mathematically expect, but when that same float is multiplied by an integer, the answer, comes out precisely to the correct number.
Here is an example from our study questions (the example is written in Java, and I have edited it down for simplicity):
float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;
float p = min + (width * count);
In this example, we are told that the result comes out to exactly 10.0
. However, if we look at this problem as a sum of floats, we get a slightly different result:
float max = 10.0f; /* Defined outside the function in the original code */
float min = 1.0f; /* Defined outside the function in the original code */
int count = 10; /* Passed to the function in the original code */
float width = (max - min) / count;
for(float p=min; p <= max; p += width){
System.out.printf("%f%n", p);
}
We are told that the final value of p
in this test is ~9.999999
with a difference of -9.536743E-7
between the last value of p
and the value of max
. From a logical standpoint (knowing how floats work), this value makes sense.
The thing that I do not understand, though, is why we get exactly 10.0 for the first example. Mathematically, it makes sense that we would get 10.0, but knowing how floats are stored in memory, it does not make sense to me. Could anyone explain why we get a precise and exact value by multiplying an imprecise float with an int?
EDIT: To clarify, in the original study questions, some of the values are passed to the function and others are declared outside of the function. My example codes are shortened and simplified versions of the study question examples. Because some of the values are passed into the function rather than being explicitly defined as constants, I believe simplification/optimization at compile time can be ruled out.