IEEE754 float point substraction precision lost

Question

Here is the subtraction

First number

Decimal       3.0000002
Hexadecimal   0x4040001
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0001]

substract second number:

Decimal 3.000000
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0000]

==========================================

At this situation, the exponent is already same, we just need to substract the mantissa. We know in IEEE754, there is a hiding bit 1 in front of mantissa. Therefore, the result mantissa should be:

Mantissa_1[1100_0000_0000_0000_0000_0001] - Mantissa_2[1100_0000_0000_0000_0000_0000]

which equal to

Mantissa_Rst = [0000_0000_0000_0000_0000_0001]

But this number is not normalized, Because of the first hiding bit is not 1. Thus we shift the Mantissa_Rst right 23 times, and the exponent minuses 23 at the same time.

Then we have the result value

Hexadecimal 0x4040000

Binary: Sign[0], Exponent[0110_1000], Mantissa[000_0000_0000_0000_0000_0000].

32 bits total, no rounding needed.

Notice that in the mantissa region, there still is a hidden 1.

If my calculations were correct, then converting result to decimal number is 0.00000023841858, comparing with the real result 0.0000002, I still think that is not very precise.

So the question is, are my calculations wrong? or actually this is a real situation and happens all the time in computer?

The decimal number `3.0000002` can't be exactly represented in base 2, it will be rounded to the closest representable number. Convert it to double precision and output more digits, you'll see what I mean. — Mark Ransom, Jul 28 '15 at 21:00
This is happens all the time n the computer. if you want to see it. Try calculate (1/3 + 1/3 +1/3) ==1. it is because that it can be exactly 0.0000002 with 23 for the mantisa and 8 for the exponent — Alon, Jul 28 '15 at 21:04
If you think this is bad, realize that this is _benign_ cancellation. There's a similar problem known as _catastrophic_ cancellation. — MSalters, Jul 29 '15 at 00:26

score 4 · Accepted Answer · answered Jul 28 '15 at 21:22

The inaccuracy already starts with your input. 3.0000002 is a fraction with a prime factor of five in the denominator, so its "decimal" expansion in base 2 is periodic. No amount of mantissa bits will suffice to represent it exactly. The float you give actually has the value 3.0000002384185791015625 (this is exact). Yes, this happens all the time.

Don't despair, though! Base ten has the same problem (for example 1/3). It isn't a problem. Well, it is for some people, but luckily there are other number types available for their needs. Floating point numbers have many advantages, and slight rounding error is irrelevant for many applications, for example when not even your inputs are perfectly accurate measurements of what you're interested in (a lot of scientific computing and simulation). Also remember that 64-bit floats also exist. Additionally, the error is bounded: With the best possible rounding, your result will be within 0.5 units in the last place removed from the infinite-precision result. For a 32-bit float of the magnitude as your example, this is approximately 2^-25, or 3 * 10^-8. This gets worse and worse as you do additional operations that have to round, but with careful numeric analysis and the right algorithms, you can get a lot of milage out of them.

So, was my hand calculation steps correct? I mean, with the hidden 1, I am not pretty sure about the way of my calculation steps. — Shuaiyu Jiang, Jul 28 '15 at 21:29
@ShuaiyuJiang I didn't look too closely at your hex calculations, but I didn't see anything obviously wrong with them. — Mark Ransom, Jul 28 '15 at 21:42
@ShuaiyuJiang I didn't redo the calculations for how much to shift, what the new exponent is, etc. but in the big picture it looks correct. — , Jul 28 '15 at 21:58

score 1 · Answer 2 · answered Jul 28 '15 at 21:55

Whenever x/2 ≤ y ≤ 2x, the calculation x - y is exact which means there is no rounding error whatsoever. That is also the case in your example.

You just made the wrong assumption that you could have a floating point number that is equal to 3.0000002. You can't. The type "float" can only ever represent integers less than 2^24, multiplied by a power of two. 3.0000002 is not such a number, therefore it is rounded to the nearest floating point number, which is closer to 3.00000023841858. Subtracting 3 calculates the difference exactly and gives a result close to 0.00000023841858.

IEEE754 float point substraction precision lost

2 Answers2