How does the int to float cast work for large numbers?

Question

If we cast an integer to a float it needs to be rounded or truncated when it gets too large to be represented exactly by a floating-point number. Here is a small test program to take a look at this rounding.

#include <stdio.h>

#define INT2FLOAT(num) printf(" %d: %.0f\n", (num), (float)(num));

int main(void)
{
    INT2FLOAT((1<<24) + 1);
    INT2FLOAT((1<<24) + 2);
    INT2FLOAT((1<<24) + 3);
    INT2FLOAT((1<<24) + 4);
    INT2FLOAT((1<<24) + 5);
    INT2FLOAT((1<<24) + 6);
    INT2FLOAT((1<<24) + 7);
    INT2FLOAT((1<<24) + 8);
    INT2FLOAT((1<<24) + 9);
    INT2FLOAT((1<<24) + 10);

    return 0;
}

The output is:

 16777217: 16777216
 16777218: 16777218
 16777219: 16777220
 16777220: 16777220
 16777221: 16777220
 16777222: 16777222
 16777223: 16777224
 16777224: 16777224
 16777225: 16777224
 16777226: 16777226

Values in the middle between two representable integers get sometimes rounded up, sometimes rounded down. It seems like some sort of round-to-even is applied. How does this work exactly? Where can I find the code that is doing this conversion?

Take a moment to examine [what a float actually is](https://en.wikipedia.org/wiki/IEEE_754). This kind of behaviour is [well defined](https://en.wikipedia.org/wiki/IEEE_754#Rounding_rules). — tadman, Jan 07 '20 at 20:11
The code varies, there are libraries used by the compiler that may not be the same ones used at runtime (have seen this on linux and windows). IEEE 754 has round up round down and round to zero if I remember right, but I think it is up to the author to choose one as a default. At the end of the day though if it conforms to the IEEE 754 or some other spec (not every one uses that format) then ultimately the conversion is governed by the spec. — old_timer, Jan 07 '20 at 20:14
It is pretty easy to do the conversion yourself be it by hand or write a small program to see what is really going on. for the integer the decimal point is to the right of bit 0 obviously, to convert to IEEE 754 float then you move that point and keep track of how far until the number is 1.fraction so just to the right of the msbit. then the number of movements of the decimal (well its not decimal, binary) point is encoded, and the mantissa is chopped off if needed with rounding. — old_timer, Jan 07 '20 at 20:17
I think what I missed is this part "Round to nearest, ties to even – rounds to the nearest value; if the number falls midway, it is rounded to the nearest value with an even least significant digit". Specifically "with an even least significant digit". Can someone point me to the code in the gcc compiler source code that handles this rounding? — hko, Jan 07 '20 at 20:22
@M.M True, not obligated to by any means, but in most cases that will be the situation. — tadman, Jan 07 '20 at 21:12
@Patricia Thanks for that information. Do you know where I can find more information on this? I assume this is handled the same for most common Intel PC CPUs. — hko, Jan 07 '20 at 21:34
@Patricia My question was more out of interest in how the rounding works. I could find a lot of results of "round half even" and I understand now how it works. I'm still interested in how this is actually implemented on a CPU or in software if the CPU doesn't support it. I couldn't find anything related to CPU implementation or a C implementation. — hko, Jan 08 '20 at 00:20
See [How to perform round to even with floating point numbers](https://stackoverflow.com/q/8981913/1798593) — Patricia Shanahan, Jan 08 '20 at 00:30
Thanks, your link eventually brought me to this [link](https://lost-contact.mit.edu/afs/cs.wisc.edu/sunx86_57/test_image/u/a/n/andrew/public/cs354/beyond354/arith.flpt.html) Which explains the rounding in good detail. — hko, Jan 08 '20 at 00:54

M.M · Accepted Answer · 2020-01-07T20:33:00.300

5

The behaviour of this implicit conversion is implementation-defined: (C11 6.3.1.4/2):

If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

This means your compiler should document how it works, but you may not be able to control it.

There are various functions and macros for controlling the rounding direction when rounding a floating-point source to an integer , but I'm not aware of any for the case of converting integer to floating.

edited Jan 07 '20 at 20:33

answered Jan 07 '20 at 20:17

M.M

138,810
21
208
365

Do you know where I could find the code that does this conversion in the gcc source code? – hko Jan 07 '20 at 20:44
2

@hko No idea but chances are it is handled by your hardware rather than occuring explicitly in gcc – M.M Jan 07 '20 at 20:51
1

@hko It's handled in hardware rather than in GCC. – S.S. Anne Jan 07 '20 at 20:57
@M.M and JL2210 do you know where can I find more information on this? I assume it is handled the same for most of the Intel PC CPUs. – hko Jan 07 '20 at 21:43
Research `fesetround(void))` to control rounding mode. – chux - Reinstate Monica Jan 07 '20 at 22:59
1

@chux-ReinstateMonica `fesetround` takes an argument, and it doesn't affect conversion of int to float – M.M Jan 07 '20 at 23:49
1

Yes, typo on wrong argument. "it doesn't affect conversion of int to float" --> interesting - something that I would certainly expect. I suppose it might be allowed under "implementation-defined manner". – chux - Reinstate Monica Jan 08 '20 at 00:23

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

In addition to what has been said in other answers, for example, intel floating point units use internally full 80 bit floating point representation with an excess in the number of bits.... so when it rounds the number to the nearest 23 bit float number (as I assume from your output) think that it is able to be very precise and consider all the bits in an int.

IEEE-752 specifies a 32bit float as a number with 23 bits dedicated to store the significand, which means that, for a normalized number, in which the most significant bit is implicit (not stored, as it is always a 1 bit) you have actually 24 bits of significand of the form 1xxxxxxx_xxxxxxxx_xxxxxxxx, which means the number 2^24-1 is the last you'll be able to represent exactly (11111111_11111111_11111111 actually). After it, you can represent all the even numbers, but not the odds, as you lack the least significant bit to represent them. This should mean you are able to represent:

                                                     v decimal dot.
16777210  == 2^24-6        11111111_11111111_11111010.
16777211  == 2^24-5        11111111_11111111_11111011.
16777212  == 2^24-4        11111111_11111111_11111100.
16777213  == 2^24-3        11111111_11111111_11111101.
16777214  == 2^24-2        11111111_11111111_11111110.
16777215  == 2^24-1        11111111_11111111_11111111.
16777216  == 2^24         10000000_00000000_00000000_. <-- here the leap becomes 2 as there are no more than 23 bits to play with.
16777217  == 2^24+1       10000000_00000000_00000000_. (there should be a 1 bit after the last 0)
16777218  == 2^24+2       10000000_00000000_00000001_.
...
33554430  == 2^25-2       11111111_11111111_11111111_.
33554432  == 2^26        10000000_00000000_00000000__. <-- here the leap becomes 4 as there's another shift
33554436  == 2^26+4      10000000_00000000_00000001__.
...

If you imagine the problem in base 10, assume we have floating point numbers of just 3 decimal digits in significand, and an exponent of ten to raise the power. When we begin counting from 0, we get this:

  1  => 1.00E0
...
  8  => 8.00E0
  9  => 9.00E0
 10  => 1.00E1  <<< see what happened here... this is the same number as the first but with the ten's exponent incremented, meaning a one digit shift of every digit to the left.
 11  => 1.10E1
...
 98  => 9.80E1
 99  => 9.90E1
100  => 1.00E2  <<< and here.
101  => 1.01E2
...
996  => 9.96E2
997  => 9.97E2
998  => 9.98E2
999  => 9.99E2
1000 => 1.00E3  <<< exact, but here you don't have anymore a fourth digit to represent units.
1001 => 1.00E3  (this number cannot be represented exactly)
...
1004 => 1.00E3  (this number cannot be represented exactly)
1005 => 1.01E3  (this number cannot be represented exactly) <<< here rounding is applied, but the implementation is free to do whatever it wants.
...
1009 => 1.01E3  (this number cannot be represented exactly)
1010 => 1.01E3 <<< this is the next number that can be represent exactly with three floating point digits.  So we switched from an increment of one by one to an increment of ten by ten.
...

Note

The case you show, is one of the rounding modes specified for the intel processors, it rounds to the even number closer, but in case it is half the distance, it counts the number of one bits in the significand and rounds up when it is odd, and rounds down when it is even (this is to avoid the rounding up always so importan in banking sometimes ---banks never use floating point because they don't have precise control on the rounding)

I think the question is how the algorithm works - not the result of the algorithm. — Björn Lindqvist, Jul 20 '23 at 01:12
@BjörnLindqvist, there's no algoritm... there's a cut of bits, because the significand of a `float` has only 25 significative bits, while an `int` has 32. if you just save the 24 second MSBs (from the next to most, as the most is not strored) you will see that when the number passes the scale to start shifting bits out, you will observe this behavior (but it's not an algorithm) — Luis Colorado, Jul 21 '23 at 05:38

score 1 · Answer 3 · answered Jul 20 '23 at 11:47

As the other have stated, the algorithm running on your machine is most certainly implemented in hardware so there is no C or assembly code that you can inspect. That said, the algorithm can also be implemented in software. Here is the algorithm works for positive integers and 32-bit ieee754 floats:

Mask out the most significant bit (msb) of the integer.
Check if msb > 23. If it isn't, the integer can be represented exactly and rounding isn't necessary.
Otherwise, divide the integer by 2^(msb - 23) into quotient (q) and remainder (r).
Round up (increment q) if; ** 2^(msb - 23) - r < r, or ** 2^(msb - 23) - r = r and q % 2 == 1 (round ties to even). ** Otherwise round down (do nothing).
If q = 2^23 increment msb and set q = 0.
The significand is q and the exponent msb + 127.

The following C code implements the algorithm using bit twiddling over division to make it more efficient. The algorithm's input is the unsigned integer u32 and its msb and its output the significand sig and exponent exp:

// Mask msb.
u32 -= (1 << msb);

uint32_t sig;
if (msb > 23) {
    // Index of the truncated part's MSB.
    int8_t trunc_msb = msb - 23;
    sig = u32 >> trunc_msb;

    // Upper bound of truncation range.
    uint32_t upper = 1 << trunc_msb;

    // Truncted value
    uint32_t trunc = u32 & (upper - 1);

    // Distance to the upper and lower bound (which is zero).
    uint32_t lo = trunc - 0;
    uint32_t hi = upper - trunc;

    // Round up if closer to upper bound than lower, or if
    // equally close round up if odd (so to even).
    if ((lo > hi) ||
        (lo == hi && (sig & 1))) {
        sig++;

        // Incrementing the sig may cause wrap-around in
        // which case we increase the msb.
        sig &= (1 << 23) - 1;
        msb += !sig;
    }
} else {
    sig = u32 << (23 - msb);
}
uint8_t exp = msb + 127;

How does the int to float cast work for large numbers?

3 Answers3

Note