Floating point accuracy again

Question

Yesterday I asked a question about why I was losing accuracy in a floating point arithmetic. I received an answer about how it was due to intermediate results being held in x87 registers. This was helpful but some of the details are still escaping me. Here's a variation of the program I presented in the previous question, I'm using VC++ 2010 Express in debug mode.

int main()
{
    double x = 1.8939201459282359e-308; /* subnormal number */
    double tiny = 4.9406564584124654e-324; /* smallest IEEE double */
    double scale = 1.6;
    double temp = scale*tiny;
    printf("%23.16e\n", x + temp);
    printf("%23.16e\n", x + scale*tiny);
}

This outputs

1.8939201459282369e-308
1.8939201459282364e-308

The first value is correct according to the IEEE standard. Giving the scale variable a value of 2.0 gives the correct value for both computations. I understand that temp in the first calculation is a subnormal value and therefore loses precision. I also understand that the value of scale*tiny is held in a x87 register which has a greater exponent range and so this value has more precision than temp. What I don't understand is when adding the value to x we get the correct answer from the lower precision value. Surely if a lower precision value can give the correct answer then a higher precision value should also give the correct answer? Is this something to do with 'double rounding'?

Thanks in advance, this is a whole new subject for me, so I'm struggling a little.

The following might well be true, but it is not at all obvious to me: *Surely if a lower precision value can give the correct answer then a higher precision value should also give the correct answer?* — NPE, Mar 16 '13 at 15:18
If I were you, I would use `long double` in such calculations... — Rontogiannis Aristofanis, Mar 16 '13 at 15:38
How do we know that the lower precision number doesn't have a random value in the last digit? There is always a 10% chance of hitting the expected one. — Bo Persson, Mar 16 '13 at 15:51
@RondogiannisAristophanes My desire is to understand what is happening. — john, Mar 16 '13 at 15:56
@BoPersson Your comment baffles me, there are no random digits, everything is determined. Plus IEEE-754 floating point is binary not decimal. — john, Mar 16 '13 at 15:58
I'm just saying that if they are both "wrong" (=outside of the precision), one of them could still be the the one you expected. — Bo Persson, Mar 16 '13 at 16:01
The one I was expecting is the IEEE-754 result. Until very recently I was under the impression that my platform supported IEEE-754 (it does, just not by default). — john, Mar 16 '13 at 16:11

Daniel Fischer · Accepted Answer · 2013-03-16T20:38:24.867

The point is that due to the larger exponent range, the two numbers are not subnormal in the x87 representation.

In the IEEE754 representation,

x    = 0.d9e66553db96f × 2^(-1022)
tiny = 0.0000000000001 × 2^(-1022)

but in the x87 representation,

x    = 1.b3cccaa7b72de × 2^(-1023)
tiny = 1.0000000000000 × 2^(-1074)

Now, when 1.6*tiny is computed in the IEEE754 representation, it is rounded to 0.0000000000002 × 2^(-1022) since that is the closest representable number to the mathematical result. Adding that to x results in

  0.d9e66553db96f × 2^(-1022)
+ 0.0000000000002 × 2^(-1022)
-----------------------------
  0.d9e66553db971 × 2^(-1022)

But in the x87 representation, 1.6*tiny becomes

1.999999999999a × 2^(-1074)

and when that is added

  1.b3cccaa7b72de × 2^(-1023)
+ 0.0000000000003333333333334 × 2^(-1023)
-----------------------------------------
  1.b3cccaa7b72e1333333333334 × 2^(-1023)

the result rounded to 53 significant bits is

  1.b3cccaa7b72e1 × 2^(-1023)

with last bit in the significand 1. If that is then converted to IEEE754 representation (where it can have at most 52 bits in the significand because it's a subnormal number), since it is exactly half way between the two adjacent representable numbers 0.d9e66553db970 × 2^(-1022) and 0.d9e66553db971 × 2^(-1022) it is by default rounded to the one with last bit in the significand zero.

Note that if the FPU were not configured to use only 53 bits for the significand, but the full 64 of the x87 extended precision type, the result of the addition would be closer to the IEEE754 result 0.d9e66553db971 × 2^(-1022) and hence rounded to that.

Effectively, since the x87 representation has a larger exponent range, you have more bits for the significands of IEEE754-subnormal numbers than in the IEEE754 representation even with restricted number of bits in the significand. Thus the result of the computation has one more significant bit here in x87 than in IEEE754.

Thanks Daniel, a worked example was **really** what I needed. So when 1.b3cccaa7b72e1 × 2^(-1023) gets converted back to IEEE-754 it's rounded down to 0.d9e66553db970 × 2^(-1022) instead of up to 0.d9e66553db971 × 2^(-1022)? What is the rounding mode for this operation in general? — john, Mar 16 '13 at 16:08
Right. (Though I don't know whether it is rounded to IEEE754 for the `printf` at all, `printf` might also use the x87 representation.) The default rounding mode in IEEE754 is round-ties-to-even, i.e. last bit of the significand zero. — Daniel Fischer, Mar 16 '13 at 16:12
Hi Daniel, a minor remark: the way you describe addition in the x87, near “due to the restriction of significant bits, it becomes 0.0000000000003 × 2^(-1023)” sounds like Cray addition (http://cs.nyu.edu/courses/fall03/G22.2420-001/lec4.pdf ). What the x87 does instead is conceptually the equivalent of computing the exact sum (1.b3cccaa7b72e1333333333334 × 2^(-1023)) and then rounding. — Pascal Cuoq, Mar 16 '13 at 18:07
@PascalCuoq Thanks, wasn't sure how the x87 works exactly in that configuration. — Daniel Fischer, Mar 16 '13 at 20:42

Floating point accuracy again

1 Answers1