Floating-point-to-integer conversion rounding up instead of truncating

Question

I was surprised to find that a floating-point-to-integer conversion rounded up instead of truncating the fractional part. Here is some sample code, compiled using Clang, that reproduces that behavior:

double a = 1.12;  // 1.1200000000000001 * 2^0
double b = 1024LL * 1024 * 1024 * 1024 * 1024;  // 1 * 2^50
double c = a * b;  // 1.1200000000000001 * 2^50
long long d = c;  // 1261007895663739

Using exact math, the floating-point value represents

1.1200000000000001 * 2^50 = 1261007895663738.9925899906842624

I was expecting the resulting integer to be 1261007895663738 due to truncation but it is actually 1261007895663739. Why?

*Using exact math* - but C does not use exact math. https://ideone.com/kG3SD4 — Eugene Sh., Mar 05 '21 at 22:05
Neither `1.12` nor `1.1200000000000001` are exact `double` values. — chtz, Mar 05 '21 at 22:45
Assuming `double` maps to IEEE-754 `binary64`: `a=1.1200000000000001e+00 b=1.1258999068426240e+15 c=1.2610078956637390e+15 trunc(c)=1.2610078956637390e+15 floor(c)=1.2610078956637390e+15 ceil(c)=1.2610078956637390e+15 rint(c)=1.2610078956637390e+15 round(c)=1.2610078956637390e+15 d=1261007895663739`. In other words, `c` is an integer value and no information is lost in the conversion to `d`. — njuffa, Mar 05 '21 at 23:11
`double a = 1.12; // 1.1200000000000001 * 2^0` this is wrong. The base is 10, not 2: `1.1200000000000001 * 10^0` — phuclv, Mar 06 '21 at 02:07

score 3 · Accepted Answer · answered Mar 06 '21 at 23:11

3

Assuming IEEE 754 double precision, 1.12 is exactly

1.12000000000000010658141036401502788066864013671875

Written in binary, its significand is exactly:

1.0001111010111000010100011110101110000101000111101100

Note the last two zeros are intentional, since it's what you get with double precision (1 bit before fraction separator, plus 52 fractional bits).

So, if you shift by 50 places, you'll get an integer value

100011110101110000101000111101011100001010001111011.00

or in decimal

1261007895663739

when converting to long long, no truncation/rounding occurs, the conversion is exact.

answered Mar 06 '21 at 23:11

aka.nice

9,100
1
28
40

Interesting. I also noticed that if I change `double` to `long double`, the floating point number is different (more precise). Can you expand on that in your answer? Thanks! – fumoboy007 Mar 08 '21 at 03:06
1

Assuming 80-bits extented precision of x86, that is 64 bits significand, 1.12 exact value is 1.120000000000000000004336808689942017736029811203479766845703125, thus slightly smaller than the double precision... Shifted by 2^50, you get exactly '1261007895663738.8800048828125'. This one shall be truncated to 1261007895663738 when converting to long long... – aka.nice Mar 08 '21 at 12:34
Note that I'm using Squeak Smalltalk to easily obtain those values. I did not mention it in the answer, because not directly related to the C tag. But since it's same underlying floating point (as far as C adhere to IEEE 754, which it does not always), any language with convenient utilities might do the work. Smalltalk has LargeInteger (bigNum), exact Fraction, and I extended it with ArbitraryPrecisionFloat, which makes a good candidate. – aka.nice Mar 08 '21 at 12:38
Here is an example of snippet that I use `((ArbitraryPrecisionFloat readFrom: '1.12' readStream numBits: 64) timesTwoPower: 50) asFraction printShowingMaxDecimalPlaces: 100` or `1.12 significandAsInteger printStringBase: 2` etc... – aka.nice Mar 08 '21 at 12:39
I think I just don’t understand floating-point well enough. Why are the Float80 and Double numbers so different? Shouldn’t the Double number be the same as the Float80 number but with less precision? – fumoboy007 Mar 08 '21 at 20:24
1

But they are almost the same, except the last bits. When a value is not representable in base 2 (like 4/3 is represented by infinite serie of bits 1+1/2^2+1/2^4+1/2^6+1/2^8+..... that is 1.01010101010101010101010...), the infinite serie of bits has to be rounded to the available precision, that is 52 bits after the leading 1 of significand for double precision and 63 bits for long double (at least on x86). 1.12 is 112/100 = 28/25. 1/25 is not reprsentable in base 2, 1/5 being the infinite bit serie 0.00110011001100110011..... – aka.nice Mar 08 '21 at 20:45
So when you init `long double x = 1.12`, what the compiler does is initilaize the long double with 112/100 rounded to long double precision. It's not the same as writing `long double x = (double) 1.12;` in which case the long double will have same bits as the double - additional 11 trailing bits will be zero. – aka.nice Mar 08 '21 at 20:51
BTW, *so different* is a bit more than 1.0e-16, that's really tiny! Consider that the last bit of 1.12 represent a quantum of 2^-52, that's about 2.2e-16, so the representation error of the double is less than 1/2 ulp as due, that's all what you can get with double precision. – aka.nice Mar 08 '21 at 21:02

chux - Reinstate Monica · Answer 2 · 2021-03-07T20:36:05.183

Using exact math, the floating-point value represents ...

a is not exactly 1.12 as 0.12 is not dyadic.

// `a` not exactly 1.12 
double a = 1.12;  // 1.1200000000000001 * 2^0

Nearby double values:

1.11999999999999988...  Next closest double
1.12                    Code
1.12000000000000011...  Closest double
1.12000000000000033...

Instead, let us look closer to truer values.

#include <stdio.h>
#include <float.h>

int main() {
  double a = 1.12;  // 1.1200000000000001 * 2^0
  double b = 1024LL * 1024 * 1024 * 1024 * 1024;  // 1 * 2^50
  int prec = DBL_DECIMAL_DIG;
  printf("a %.*e\n", prec, a);
  printf("b %.*e\n", prec, b);

  double c = a * b;
  double whole;
  printf("c %.*e (r:%g)\n", prec, c, modf(c, &whole));
  long long d = (long long) c;
  printf("d %lld\n", d);
}

Output

a 1.12000000000000011e+00
b 1.12589990684262400e+15
c 1.26100789566373900e+15 (r:0)
d 1261007895663739

Floating-point-to-integer conversion rounding up instead of truncating

2 Answers2