11

Assume that t,a,b are all double (IEEE Std 754) variables, and both values of a, b are NOT NaN (but may be Inf). After t = a - b, do I necessarily have a == b + t?

sarnold
  • 102,305
  • 22
  • 181
  • 238
updogliu
  • 6,066
  • 7
  • 37
  • 50
  • I believe the result of an underflow there would be undefined, and so would that of an overflow in the second expression, so no. If someone could confirm that, it'd be nice. – chris May 29 '12 at 00:52
  • Ah, I guess this kind of confirms that overflow is undefined for floating-point too: `As with any other arithmetic overflow, if the result does not fit in the space provided, the behavior is undefined.` – chris May 29 '12 at 00:58
  • 3
    In a C implementation conforming to IEEE 754, there is no UB for any floating point arithmetic. All results are strictly defined. – R.. GitHub STOP HELPING ICE May 29 '12 at 01:00
  • @R.., thanks. I don't have that particular version available and that text is what was in mine. – chris May 29 '12 at 01:02
  • 3
    If a=b=Inf then t=NaN and the second equation fails. You can probably find another example using negative zero. – Raymond Chen May 29 '12 at 01:06
  • C allows a lot (some would say *complete*, i.e. results can be completely meaningless) leeway in what floating point does in a non-IEEE-conforming implementation. In reality, most real-world implementations that people care about purport to conform to IEEE 754. The text governing normative behavior for C implementations aiming to support IEEE arithmetic is in Annex F of the C standard. – R.. GitHub STOP HELPING ICE May 29 '12 at 01:07
  • 2
    @RaymondChen: Negative zero is a great example too, but see my answer for some less obscure cases... – R.. GitHub STOP HELPING ICE May 29 '12 at 01:07

3 Answers3

26

Absolutely not. One obvious case is a=DBL_MAX, b=-DBL_MAX. Then t=INFINITY, so b+t is also INFINITY.

What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b is inexact. For example, if a is DBL_EPSILON/4 and b is -1, a-b is 1 (assuming default rounding mode), and a-b+b is then 0.

The reason I mention this second example is that this is the canonical way of forcing rounding to a particular precision in IEEE arithmetic. For instance, if you have a number in the range [0,1) and want to force rounding it to 4 bits of precision, you would add and then subtract 0x1p49.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    The 2nd example is great as it doesn't incur Inf nor NaN. Thanks a lot. – updogliu May 29 '12 at 01:50
  • 1
    You might want to clarify the `0x1p49` constant, the last time I looked hex digits ran from 0 to F ;) – MSalters May 29 '12 at 08:22
  • 8
    @MSalters: "0x1p49" is hexadecimal floating-point, as defined in the C standard. The format is "0x" "p" , where is a hexadecimal numeral, optionally including a period, and is a decimal numeral, optionally including a sign. The base for the exponent is two, so 0x1p49 is 2**49. 0x1p-4 would be 1/16, and 0x1.23p8 would be (1 + 2/16 + 3/256) * 2**8 = 291. Hexadecimal floating-point provides a format that is easy for humans and compilers to convert to and from binary floating-point encodings without rounding issues. – Eric Postpischil May 29 '12 at 12:46
1

In the process of doing the first operation, bits could have been lost off the low end of the result. So one question is, will the second operation exactly reproduce those losses? I haven't fully thought that out.

But, of course, the first operation could have overflowed to +/-infinity, rendering the second compare unequal.

(And, of course, in the general case using == for floating-point values is almost always a bug.)

Hot Licks
  • 47,103
  • 17
  • 93
  • 151
  • 1
    Just by a counting argument, the second operation can't bring back what was lost. If it could, you'd be storing more bits of information in `t` than the number of bits in `t`... – R.. GitHub STOP HELPING ICE May 29 '12 at 01:23
  • @R -- Yes. Intuitively one knows that it won't work, because of what you say, but finding examples is better "proof" than appealing to an esoteric rule, no matter how valid. – Hot Licks May 29 '12 at 01:25
-3

You are not guaranteed anything when using floats. If the exponent is different for both numbers, the result of an arithmetic operation may not be completely representable in a float.

Consider this code:

float a = 0.003f;
float b = 10000000.0f;
float t = a - b;
float x = b + t;

Running on Visual Studio 2010, you get t==-10000000.0f, and therefore x==0.

You should never use equality when comparing floats. Instead compare the absolute value of the difference between both values and an epsilon value small enough for your precision needs.

It gets even weirder as different floating point implementations may return different results for the same operation.

user1003819
  • 502
  • 1
  • 4
  • 10
  • 5
    I never liked the "compare the absolute value of the difference" advice. It is possible to get bounds on errors ([What Every Computer Scientist Should Know About Floating-Point Arithmetic](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) is a good start) and one should think about what one is trying to accomplish with the comparison before blindly switching to some arbitrary bound. – Joshua Green May 29 '12 at 03:08
  • 8
    There are plenty of things that are guaranteed when using IEEE-754 floats. This happens to not be one of them. – Stephen Canon May 29 '12 at 03:10
  • There are lots of guarantees when using IEEE floats, and there are times when comparing for equality is not only reasonable, but essential. Floating-point math is definitely tricky, but it isn't random or malicious. Here's an example from my blog of when testing for floating-point equality is critical: https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/ – Bruce Dawson Feb 23 '15 at 17:21