Floating point differences between 64 bit and 32 bit with Round

Question

I know all about the approximation issues with floating point numbers so I understand how 4.5 can get rounded down to 4 if it was approximated as 4.4999999999999991. My question is why is there a difference using the same types with 32 bit and 64 bit.

In the code below I have two calculations. In 32 bit the value for MyRoundValue1 is 4 and the value for MyRoundValue2 is 5. In 64 bit they are both 4. Shouldn't the results be consistent with both 32 bit and 64 bit?

{$APPTYPE CONSOLE}
const
  MYVALUE1: Double = 4.5;
  MYVALUE2: Double = 5;
  MyCalc: Double = 0.9;
var
  MyRoundValue1: Integer;
  MyRoundValue2: Integer;
begin
  MyRoundValue1 := Round(MYVALUE1);
  MyRoundValue2 := Round(MYVALUE2 * MyCalc);
  WriteLn(IntToStr(MyRoundValue1));
  WriteLn(IntToStr(MyRoundValue2));
end.

score 8 · Accepted Answer · answered Jul 14 '15 at 19:53

In x87 this code:

MyRoundValue2 := Round(MYVALUE2 * MyCalc);

Is compiled to:

MyRoundValue2 := Round(MYVALUE2 * MyCalc);
0041C4B2 DD0508E64100     fld qword ptr [$0041e608]
0041C4B8 DC0D10E64100     fmul qword ptr [$0041e610]
0041C4BE E8097DFEFF       call @ROUND
0041C4C3 A3C03E4200       mov [$00423ec0],eax

The default control word for the x87 unit under the Delphi RTL performs calculations to 80 bit precision. So the floating point unit multiplies 5 by the closest 64 bit value to 0.9 which is:

0.90000 00000 00000 02220 44604 92503 13080 84726 33361 81640 625

Note that this value is greater than 0.9. And it turns out that when multiplied by 5, and rounded to the nearest 80 bit value, the value is greater than 4.5. Hence Round(MYVALUE2 * MyCalc) returns 5.

On 64 bit, the floating point math is done on the SSE unit. That does not use 80 bit intermediate values. And it turns out that 5 times the closest double to 0.9, rounded to double precision is exactly 4.5. Hence Round(MYVALUE2 * MyCalc) returns 4 on 64 bit.

You can persuade the 32 bit compiler to behave the same way as the 64 bit compiler by storing to a double rather than relying on intermediate 80 bit values:

{$APPTYPE CONSOLE}
const
  MYVALUE1: Double = 4.5;
  MYVALUE2: Double = 5;
  MyCalc: Double = 0.9;
var
  MyRoundValue1: Integer;
  MyRoundValue2: Integer;
  d: Double;
begin
  MyRoundValue1 := Round(MYVALUE1);
  d := MYVALUE2 * MyCalc;
  MyRoundValue2 := Round(d);
  WriteLn(MyRoundValue1);
  WriteLn(MyRoundValue2);
end.

This program produces the same output as your 64 bit program.

Or you can force the x87 unit to use 64 bit intermediates.

{$APPTYPE CONSOLE}
uses
  SysUtils;
const
  MYVALUE1: Double = 4.5;
  MYVALUE2: Double = 5;
  MyCalc: Double = 0.9;
var
  MyRoundValue1: Integer;
  MyRoundValue2: Integer;
begin
  Set8087CW($1232); //  <-- round intermediates to 64 bit
  MyRoundValue1 := Round(MYVALUE1);
  MyRoundValue2 := Round(MYVALUE2 * MyCalc);
  WriteLn(MyRoundValue1);
  WriteLn(MyRoundValue2);
end.

@LURD I dare. There are many scenarios where you have to. A good example is when dealing with external libraries. Sometimes they don't like it if exceptions are unmasked. I'm looking at you Excel 2013. In my work, getting the 32 bit version to behave close to the 64 bit version is important. Hence `$1232` is how my 32 bit version rolls. — David Heffernan, Jul 14 '15 at 19:59
@LURD Of course, as you all must be tired of me saying, it does not help that the Delphi RTL function Set8087CW is not threadsafe. As I have said so many times, I've told Emba how to sort this out but they won't do it. Perhaps because they are too scared to change. — David Heffernan, Jul 14 '15 at 20:00
Why does the compiler use 80bit intermediate values when it's working with doubles or is that because it passing the result into a method that takes an `Extended`? — Graymatter, Jul 14 '15 at 20:25
@Graymatter After `fmul qword ptr [$0041e610]` the x87 unit has in ST(0) the result of 5*0.9, but rounded to 80 bit precision. Because that's how the x87 unit is configured. The compiler opts to use ST(0) directly rather than store to double, and reload. It's more efficient. — David Heffernan, Jul 14 '15 at 20:40

score 3 · Answer 2 · answered Jul 14 '15 at 19:41

3

System.Round internally accepts an Extended value. In 32-bit calculations are made as Extended inside the FPU. In 64-bit Extended is similar to Double. The internal representation might just differ that much to make the difference.

answered Jul 14 '15 at 19:41

Uwe Raabe

45,288
3
82
130

`Extended` is not *similar to Double* in 64bit, it *IS* a `Double`. `Extended` in 32bit is a native 80bit FPU data type, but in 64bit it is just an alias for `Double`. That is 16 bits of lost precision in 64bit systems. This is [documented](http://docwiki.embarcadero.com/Libraries/XE8/en/System.Extended): "On Win32 systems, the size of System.Extended is 10 bytes. On Win64 systems, however, the **System.Extended** type is an alias for System.Double, which is only 8 bytes. This difference can adversely affect numeric precision in floating-point operations." – Remy Lebeau Jul 14 '15 at 20:02

Floating point differences between 64 bit and 32 bit with Round

2 Answers2