How can I convert an integer to float with rounding towards zero?

Question

When an integer is converted to floating-point, and the value cannot be directly represented by the destination type, the nearest value is usually selected (required by IEEE-754).

I would like to convert an integer to floating-point with rounding towards zero in case the integer value cannot be directly represented by the floating-point type.

Example:

int i = 2147483647;
float nearest = static_cast<float>(i);  // 2147483648 (likely)
float towards_zero = convert(i);        // 2147483520

An integer (even 64 bits) can always be represented as a floating point. An integer has no fractional part so there is no rounding issue. — Paul Ogilvie, Jul 31 '20 at 13:06
@PaulOgilvie That's not true. The mantissa of floating-point numbers is not large enough to hold all possible integer values. — interjay, Jul 31 '20 at 13:08
@PaulOgilvie How do you represent 64 bits of data in 32 bits? You can't. There will be a lossy conversion — NathanOliver, Jul 31 '20 at 13:08
Why is `2147483520` the required "rounded towards zero" result? — Paul Ogilvie, Jul 31 '20 at 13:15
@PaulOgilvie because float cant represent the integers between 2147483520 and 2147483648. — jjj, Jul 31 '20 at 13:16
@PaulOgilvie: In IEEE-754 binary32, each finite number is represented as ±d.ddd…ddd•2^e, where d is a binary digit, there are 23 ds after the “.”, and −1022≤e<1024. This equals ±dddd…ddd.•2^(e−23), where we have moved the “.” and adjusted the exponent. When a number represented in this format is 2^24 or greater, e must be 24 or greater. In this case, ±dddd…ddd is an integer and 2^(e−23) is even, so odd integers cannot be represented. As the number exceeds 2^25, 2^26, and so on, 2^(e−23) becomes a multiple of 4, then 8, and so on, further reducing which integers in the range can be represented. — Eric Postpischil, Jul 31 '20 at 13:31
@PaulOgilvie: So, no, not all integers can be represented in a floating-point format. Then consider some number like 16,777,217. It cannot be represented. The two nearest representable values are 16,777,216 and 16,777,218. “Rounding towards zero” means exactly that: When rounding 16,777,217 to produce a value in the binary32 format, we will pick the choice toward zero, so we pick 16,777,216. — Eric Postpischil, Jul 31 '20 at 13:33
@EricPostpischil, thank you very much for this explanation and example. — Paul Ogilvie, Jul 31 '20 at 13:44
[Related](https://stackoverflow.com/questions/52582831/loss-of-precision-for-int-to-float-conversion/52583193#52583193). — Eric Postpischil, Jul 31 '20 at 14:31
Please refrain from tagging as C *and* C++. This yields an answer for each language (plus some that claim to work for both, often incorrectly so) which are both correct or "acceptable". — ljrk, Aug 01 '20 at 09:28
@larkey I doubt this question has an answer independant of a computer language, where I count all instances of assembler code to implement IEEE-754 as a separate language. — Albert van der Horst, Aug 01 '20 at 18:15
@AlbertvanderHorst Yes, that's why it should be tagged either C or C++, but not both. The answer is language dependent and tagging it as both doesn't make sense. — ljrk, Aug 01 '20 at 20:18
I doubt that tagging is the answer. The question should state whether a solution is wanted in general, for C only, for C++ only, for both C and C++, or for either one of C or C++. Tagging is for searching I presume, and cannot change the meaning of a question. Am I wrong? — Albert van der Horst, Aug 03 '20 at 12:27

Eric Towers · Answer 1 · 2020-08-02T19:04:07.943

27

Since C++11, one can use fesetround(), the floating-point environment rounding direction manager. There are four standard rounding directions and an implementation is permitted to add additional rounding directions.

#include <cfenv> // for fesetround() and FE_* macros
#include <iostream> // for cout and endl
#include <iomanip> // for setprecision()

#pragma STDC FENV_ACCESS ON

int main(){
    int i = 2147483647;

    std::cout << std::setprecision(10);

    std::fesetround(FE_DOWNWARD);
    std::cout << "round down         " << i << " :  " << static_cast<float>(i) << std::endl;
    std::cout << "round down        " << -i << " : " << static_cast<float>(-i) << std::endl;

    std::fesetround(FE_TONEAREST);
    std::cout << "round to nearest   " << i << " :  " << static_cast<float>(i) << std::endl;
    std::cout << "round to nearest  " << -i << " : " << static_cast<float>(-i) << std::endl;

    std::fesetround(FE_TOWARDZERO);
    std::cout << "round toward zero  " << i << " :  " << static_cast<float>(i) << std::endl;
    std::cout << "round toward zero " << -i << " : " << static_cast<float>(-i) << std::endl;

    std::fesetround(FE_UPWARD);
    std::cout << "round up           " << i << " :  " << static_cast<float>(i) << std::endl;
    std::cout << "round up          " << -i << " : " << static_cast<float>(-i) << std::endl;

    return(0);
}

Compiled under g++ 7.5.0, the resulting executable outputs

round down         2147483647 :  2147483520
round down        -2147483647 : -2147483648
round to nearest   2147483647 :  2147483648
round to nearest  -2147483647 : -2147483648
round toward zero  2147483647 :  2147483520
round toward zero -2147483647 : -2147483520
round up           2147483647 :  2147483648
round up          -2147483647 : -2147483520

Omitting the #pragma doesn't seem to change anything under g++.
@chux comments correctly that the standard doesn't explicitly state that fesetround() affects rounding in static_cast<float>(i). For a guarantee that the set rounding direction affects the conversion, use std::nearbyint and its -f and -l variants. See also std::rint and its many type-specific variants.
I probably should have looked up the format specifier to use a space for positive integers and floats, rather than stuffing it into the preceding string constants.

(I haven't tested the following snippet.) Your convert() function would be something like

float convert(int i, int direction = FE_TOWARDZERO){
    float retVal = 0.;
    int prevdirection = std::fegetround();
    std::fesetround(direction);
    retVal = static_cast<float>(i);
    std::fesetround(prevdirection);
    return(retVal);
}

edited Aug 02 '20 at 19:04

answered Jul 31 '20 at 22:50

Eric Towers

4,175
1
15
17

2

I also agree this is the overall [best approach](https://stackoverflow.com/a/63196297/2410359), yet even with `std::fesetround(FE_TOWARDZERO)` I do not see C++ as **specifying** that `static_cast(i)` will perform as desired, yet it is entirely reasonable that it should do so. – chux - Reinstate Monica Aug 01 '20 at 01:37
3

Be aware that the GCC implementation has bugs and can ignore the rounding mode changes sometimes: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678 – jpa Aug 01 '20 at 07:53
2

For a facility added in C++11, it would have been nice to specify whether the "environment" meant the entire process or only the current thread. Switching the rounding mode of the entire process is the best way to have weird results in neighbor threads... – Matthieu M. Aug 01 '20 at 15:11
1

@MatthieuM. Unfortunate oversight in the standard, but on normal real-world implementations (on machines with hardware FPUs at least), the FP rounding / exception settings are per-thread. (Typically a special CPU register, like x86-64 MXCSR. This register is part of the context / architectural state of each thread that context switches save/restore.) IDK if there are any soft-FP implementations where it's global. (I assume *you* know that, I'm commenting for future readers that this is normally safe in practice.) – Peter Cordes Aug 01 '20 at 16:10
@PeterCordes: I actually only knew for x64; I've never used other architectures so my knowledge is patchy to non-existent when it comes to them. Thanks for confirming that it should behave sanely on typical hardware :) – Matthieu M. Aug 01 '20 at 17:03
@MatthieuM.: I'm not super familiar with many other ISAs, but from a computer-architecture POV it's the only sane design. A control register shared by all cores would couple out-of-order speculative exec between cores, and be basically unusable for lots of cases (like different cores running different processes). And would need some special interconnect support for multi-socket systems to make changes on one socket affect others. Just to make sure, I checked and PowerPC has a FPSCR described as a "user-level register"; AArch64 / ARM32 have a similar FPCR / FPSCR register. – Peter Cordes Aug 01 '20 at 17:31
2

@MatthieuM. `For a facility added in C++11, it would have been nice to specify whether the "environment" meant the entire process or only the current thread.` I didn't check C++11, but at least the current draft specifies it clearly: `[cfenv.syn] The floating-point environment has thread storage duration.` – eerorika Aug 01 '20 at 18:19
@chux-ReinstateMonica : I agree, and am a little surprised to learn, that `static_cast()` does not claim influence by the current floating point rounding direction. I've added a bullet about the explicitly influenced `nearbyint()` and `rint()` families of conversion functions. – Eric Towers Aug 02 '20 at 19:00
Re: "guarantee that the set rounding direction affects the conversion", those functions lack a `float nearbyintf( IntegralType arg )` as needed by OP. – chux - Reinstate Monica Aug 14 '20 at 12:20
@chux-ReinstateMonica : OP writes "I would like to convert an integer to floating-point" and provides an example using `float`. This does not mean OP requires a `float` return, only that the example does. If `float` is essential, adapt https://stackoverflow.com/questions/15294046/round-a-double-to-the-closest-and-greater-float using `std::nextafter(from, to)`, with, in the `FE_TONEAREST` case, conversion back to `double` and comparison of differences to select among the two potential `float`s. – Eric Towers Aug 14 '20 at 14:21
EricTowers That approach works most of the time yet makes incorrect results due to [double rounding](https://en.wikipedia.org/wiki/Rounding#Double_rounding) from time to time. Best solutions only round once, all other math needs to be exact. – chux - Reinstate Monica Aug 14 '20 at 17:53
The `convert()` function usually does not work in optimized code (as neither GCC nor Clang support `#pragma STDC FENV_ACCESS ON`). The optimized code has statements reordered as `std::fesetround(direction); std::fesetround(prevdirection); return static_cast(i);`. Making some variables `volatile` is a lame workaround. https://godbolt.org/z/qK1jdf – Paweł Bylica Aug 28 '20 at 08:01

jjj · Answer 2 · 2020-07-31T15:01:03.530

11

You can use std::nextafter.

int i = 2147483647;
float nearest = static_cast<float>(i);  // 2147483648 (likely)
float towards_zero = std::nextafter(nearest, 0.f);        // 2147483520

But you have to check, if static_cast<float>(i) is exact, if so, nextafter would go one step towards 0, which you probably don't want.

Your convert function might look like this:

float convert(int x){
    if(std::abs(long(static_cast<float>(x))) <= std::abs(long(x)))
        return static_cast<float>(x);
    return std::nextafter(static_cast<float>(x), 0.f);
}

It may be that sizeof(int)==sizeof(long) or even sizeof(int)==sizeof(long long) in this case long(...) may behave undefined, when the static_cast<float>(x) exceeds the possible values. Depending on the compiler it might still work in this cases.

edited Jul 31 '20 at 15:01

answered Jul 31 '20 at 13:15

jjj

575
1
3
16

5

The problem is how to detect when `nextafter` is needed. The check `int(static_cast(x)) == x` may result in undefined behavior. Example: `2147483647` to `float` is `2147483648.0f` and back to `int` is undefined behavior as `2147483648` cannot be represented by the `int` type. See: https://en.cppreference.com/w/cpp/language/implicit_conversion#Floating.E2.80.93integral_conversions. – Paweł Bylica Jul 31 '20 at 13:40
4

`int(static_cast(x))` is not defined if `int` is 32-bit and `static_cast(x)` produces 2147483648. C++ 2018 (draft N4659) 7.10 [conv.fpint] 1 says “The behavior is undefined if the truncated value cannot be represented in the destination type.” C has similar wording. – Eric Postpischil Jul 31 '20 at 13:41
`g++ returns max or min int` Could you quote / link to the documentation that specifies that. – eerorika Jul 31 '20 at 13:47
I changed int(static_cast(x)) to long(static_cast(x)), so there is not problem anymore and this is faster than to do a check – jjj Jul 31 '20 at 14:00
1

`static_cast(x)` could already round down (e.g., for `x=(1<<24)+1`), so you should check for `<=` instead of `==` (in case `x` is positive) – chtz Jul 31 '20 at 14:01
7

How does `long` help? `sizeof(int)` can be equal to `sizeof(long)`. – Evg Jul 31 '20 at 14:02
@chtz: That would serve for round down (round toward negative infinity), but the request is for round toward zero. – Eric Postpischil Jul 31 '20 at 14:36
@EricPostpischil yes, my comment was only for the case that `x` is positive. Of course, you'll need to check more cases if negative inputs can happen. – chtz Jul 31 '20 at 14:40
2

`convert(INT_MIN)` is a problem due to `std::abs` – chux - Reinstate Monica Jul 31 '20 at 17:02
3

`std::abs(long(x))` can have UB if `sizeof(long) == sizeof(int)` and `x = INT_MIN`. – chqrlie Aug 01 '20 at 07:15
1

@chqrlie That could be fixed by flipping the logic, and writing a custom "anti-abs" function that converts positive to negative. – eerorika Aug 01 '20 at 18:26

chux - Reinstate Monica · Answer 3 · 2021-03-22T08:55:44.203

11

A C implementation dependent solution that I am confident has a C++ counterpart.

Temporarily change the rounding mode the conversion uses that to determine which way to go in inexact cases.

the nearest value is usually selected (required by IEEE-754).

Is not entirely accurate. The inexact case is rounding mode dependent.

C does not specify this behavior. C allows this behavior, as it is implementation-defined.

If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner.

#include <fenv.h>

float convert(int i) {
   #pragma STDC FENV_ACCESS ON
   int save_round = fegetround();
   fesetround(FE_TOWARDZERO);
   float f = (float) i;
   fesetround(save_round);
   return f;
}

edited Mar 22 '21 at 08:55

answered Jul 31 '20 at 17:15

chux - Reinstate Monica

143,097
13
135
256

1

Don't forget about `#pragma STDC FENV_ACCESS ON`, otherwise this has undefined behavior. – Ruslan Aug 01 '20 at 19:12
The `convert()` function usually does not work in optimized code (as neither GCC nor Clang support `#pragma STDC FENV_ACCESS ON`). The optimized code has statements reordered as `fesetround(FE_TOWARDZERO); fesetround(save_round); return (float) i;`. Making some variables `volatile` is a lame workaround. https://godbolt.org/z/TaPxqa – Paweł Bylica Aug 28 '20 at 07:55

njuffa · Answer 4 · 2020-08-01T05:12:34.687

I understand the question to be restricted to platforms that use IEEE-754 binary floating-point arithmetic, and where float maps to IEEE-754 (2008) binary32. This answer assumes this to be the case.

As other answers have pointed out, if the tool chain and the platform supports this, use the facilities supplied by fenv.h to set the rounding mode for the conversion as desired.

Where those are not available, or slow, it is not difficult to emulate the truncation during int to float conversion. Basically, normalize the integer until the most significant bit is set, recording the required shift count. Now, shift the normalized integer into place to form the mantissa, compute the exponent based on the normalization shift count, and add in the sign bit based on the sign of the original integer. The process of normalization can be sped up significantly if a clz (count leading zeros) primitive is available, maybe as an intrinsic.

The exhaustively tested code below demonstrates this approach for 32-bit integers, see function int32_to_float_rz. I successfully built it as both C and C++ code with the Intel compiler version 13.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <fenv.h>

float int32_to_float_rz (int32_t a)
{
    uint32_t i = (uint32_t)a;
    int shift = 0;
    float r;
    // take absolute value of integer
    if (a < 0) i = 0 - i;
    // normalize integer so MSB is set
    if (!(i > 0x0000ffffU)) { i <<= 16; shift += 16; }
    if (!(i > 0x00ffffffU)) { i <<=  8; shift +=  8; }
    if (!(i > 0x0fffffffU)) { i <<=  4; shift +=  4; }
    if (!(i > 0x3fffffffU)) { i <<=  2; shift +=  2; }
    if (!(i > 0x7fffffffU)) { i <<=  1; shift +=  1; }
    // form mantissa with explicit integer bit 
    i = i >> 8;
    // add in exponent, taking into account integer bit of mantissa
    if (a != 0) i += (127 + 31 - 1 - shift) << 23;
    // add in sign bit
    if (a < 0) i |= 0x80000000;
    // reinterpret bit pattern as 'float'
    memcpy (&r, &i, sizeof r);
    return r;
}

#pragma STDC FENV_ACCESS ON

float int32_to_float_rz_ref (int32_t a)
{
    float r;
    int orig_mode = fegetround ();
    fesetround (FE_TOWARDZERO); 
    r = (float)a;
    fesetround (orig_mode); 
    return r;
}

int main (void) 
{
    int32_t arg;
    float res, ref;

    arg = 0;
    do {
        res = int32_to_float_rz (arg);
        ref = int32_to_float_rz_ref (arg);
        if (res != ref) {
            printf ("error @ %08x: res=% 14.6a  ref=% 14.6a\n", arg, res, ref);
            return EXIT_FAILURE;
        }
        arg++;
    } while (arg);
    return EXIT_SUCCESS;
}

`i >> 8`, `i += (127 + 31 - 1 - shift) << 23` assume `float` characteristics. Not unreasonable assumptions, yet not specified by C. `memcpy (&r, &i, sizeof r);` also relies on reasonable, yet unspecified assumptions about `float` size matching `int32_t`and common integers and FP endian. — chux - Reinstate Monica, Aug 01 '20 at 05:02
I am suspicious about the manual conversion as I'd expect code to mask off the MSbit of the significand in the formation of the `float`. `i = i >> 8;` look insufficient. Perhaps `i = (i&0x7FFFFFFF) >> 8;`? — chux - Reinstate Monica, Aug 01 '20 at 05:06
I'll add language about the assumption that `float` maps to IEEE-754 `binary32`. By my reading that assumption is in the question. The integer bit of the mantissa doesn't need to be masked during combining since the exponent LSB is decremented instead (see comment in code). — njuffa, Aug 01 '20 at 05:10
1) Interesting approach to masking out the implied bit. 2) Even if `float` is IEEE-754 binary32, the endian issue remains - although I find it increasing rare for FP and integer endian to differ. 3) IEEE uses _significand_ not _mantissa_ (subtle differences). 4) UV for a well tested answer. — chux - Reinstate Monica, Aug 01 '20 at 05:26
Perhaps a good idea to factor out the leading-zero handling to a separate function to make it easy to use GNU C `__builtin_clz` or other intrinsic as a drop-in replacement. Or just show `shift = _lzcnt_u32(i); i <<= shift;` as a comment. Oh, you did mention that in the text. I guess you need sign-extension so you can't just do `i <<= shift - 8;` — Peter Cordes, Aug 01 '20 at 15:55

score 6 · Answer 5 · answered Aug 01 '20 at 15:41

A specified approach.

"the nearest value is usually selected (required by IEEE-754)" implies OP expects IEEE-754 is involved. Many C/C++ implementation do follow much of IEEE-754, yet adherence to that spec is not required. The following relies on C specifications.

Conversion of an integer type to a floating point type is specified as below. Notice conversion is not specified to depend on rounding mode.

When a value of integer type is converted to a real floating type, if the value being converted can be represented exactly in the new type, it is unchanged. If the value being converted is in the range of values that can be represented but cannot be represented exactly, the result is either the nearest higher or nearest lower representable value, chosen in an implementation-defined manner. C17dr § 6.3.1.4 2

When the result it not exact, the converted value the nearest higher or nearest lower?
A round trip int --> float --> int is warranted.

Round tripping needs to watch out for convert(near_INT_MAX) converting to outside the int range.

Rather than rely on long or long long having a wider range than int (C does not specify this property), let code compare on the negative side as INT_MIN (with 2's complement) can be expected to convert exactly to a float.

float convert(int i) {
  int n = (i < 0) ? i : -i;  // n <= 0
  float f = (float) n;
  int rt_n = (int) f;  // Overflow not expected on the negative side
  // If f rounded away from 0.0 ...
  if (rt_n < n) {
    f = nextafterf(f, 0.0);  // Move toward 0.0
  }
  return (i < 0) f : -f;
}

score 6 · Answer 6 · answered Aug 01 '20 at 16:38

Changing the rounding mode is somewhat expensive, although I think some modern x86 CPUs do rename MXCSR so it doesn't have to drain the out-of-order execution back-end.

If you care about performance, benchmarking njuffa's pure integer version (using shift = __builtin_clz(i); i<<=shift;) against the rounding-mode-changing version would make sense. (Make sure to test in the context you want to use it in; it's so small that it matters how well it overlaps with surrounding code.)

AVX-512 can use rounding-mode overrides on a per-instruction basis, letting you use a custom rounding mode for the conversion basically the same cost as a normal int->float. (Only available on Intel Skylake-server, and Ice Lake CPUs so far, unfortunately.)

#include <immintrin.h>

float int_to_float_trunc_avx512f(int a) {
  const __m128 zero = _mm_setzero_ps();      // SSE scalar int->float are badly designed to merge into another vector, instead of zero-extend.  Short-sighted Pentium-3 decision never changed for AVX or AVX512
  __m128 v = _mm_cvt_roundsi32_ss (zero, a, _MM_FROUND_TO_ZERO |_MM_FROUND_NO_EXC);
  return _mm_cvtss_f32(v);               // the low element of a vector already is a scalar float so this is free.
}

_mm_cvt_roundi32_ss is a synonym, IDK why Intel defined both i and si names, or if some compilers might only have one.

This compiles efficiently with all 4 mainstream x86 compilers (GCC/clang/MSVC/ICC) on the Godbolt compiler explorer.

# gcc10.2 -O3 -march=skylake-avx512
int_to_float_trunc_avx512f:
        vxorps  xmm0, xmm0, xmm0
        vcvtsi2ss       xmm0, xmm0, {rz-sae}, edi
        ret

int_to_float_plain:
        vxorps  xmm0, xmm0, xmm0             # GCC is always cautious about false dependencies, spending an extra instruction to break it, like we did with setzero()
        vcvtsi2ss       xmm0, xmm0, edi
        ret

In a loop, the same zeroed register can be reused as a merge target, allowing the vxorps zeroing to be hoisted out of a loop.

Using _mm_undefined_ps() instead of _mm_setzero_ps(), we can get ICC to skip zeroing XMM0 before converting into it, like clang does for plain (float)i in this case. But ironically, clang which is normally cavalier and reckless about false dependencies compiles _mm_undefined_ps() the same as setzero in this case.

The performance in practice of vcvtsi2ss (scalar integer to scalar single-precision float) is the same whether you use a rounding-mode override or not (2 uops on Ice Lake, same latency: https://uops.info/). The AVX-512 EVEX encoding is 2 bytes longer than the AVX1.

Rounding mode overrides also suppress FP exceptions (like "inexact"), so you couldn't check the FP environment to later detect if the conversion happened to be exact (no rounding). But in this case, converting back to int and comparing would be fine. (You can do that without risk of overflow because of the rounding towards 0).

score 4 · Answer 7 · answered Aug 01 '20 at 18:22

Shift the integer right by an arithmetic shift until the number of bits agrees with the precision of the floating point arithmetic. Count the shifts.
Convert the integer to float. The result is now precise.
Multiply the resulting float by a power of two corresponding to the number of shifts.

eerorika · Answer 8 · 2020-07-31T14:48:21.180

3

A simple solution is to use a higher precision floating point for comparison. As long as the high precision floating point can exactly represent all integers, we can accurately compare whether the float result was greater.

double should be sufficient with 32 bit integers, and long double is sufficient for 64 bit on most systems, but it's good practice to verify it.

float convert(int x) {
    static_assert(std::numeric_limits<double>::digits
                  >= sizeof(int) * CHAR_BIT);
    float  f = x;
    double d = x;
    return std::abs(f) > std::abs(d)
        ? std::nextafter(f, 0.f)
        : f;
}

edited Jul 31 '20 at 14:48

answered Jul 31 '20 at 14:41

eerorika

232,697
12
197
326

Re *"As long as the high precision floating point can exactly represent all integers"*: Isn't that [impossible with IEEE-754](https://www.youtube.com/watch?v=MBWAP_8zxaM&t=34m24s)? – Peter Mortensen Aug 01 '20 at 00:55
2

@PeterMortensen Why would it be impossible? – eerorika Aug 01 '20 at 01:20
1

@eerorika: your solution does not work if `sizeof(int) == sizeof(double)`. ie on an architecture with 64-bit `int`, `long`, `float` and `double`. Using `long double` does not help since these can be 64-bit as well. – chqrlie Aug 01 '20 at 07:17
1

@chqrlie Of course not. It works if mantissa is at least the size of the int. The static assert will tell you if the exotic target system is incompatible. – eerorika Aug 01 '20 at 11:16
Unfortunately `std::nextafter` is not as fast as it could be on most implementations, especially for this use case. You just need an integer decrement of the FP bit-pattern to decrease the magnitude, but a non-inlined `nextafterf` will have to compare its 2 args and check for special cases. Hmm, maybe I should expand my x86 intrinsics answer to include a SSE2 version that manually inlines the nextafter. – Peter Cordes Aug 01 '20 at 18:09

dbush · Answer 9 · 2020-07-31T15:05:47.230

For nonnegative values, this can be done by taking the integer value and shifting right until the highest set bit is less than 24 bits (i.e. the precision of IEEE single) from the right, then shifting back.

For negative values, you would shift right until all bits from 24 and up are set, then shift back. For the shift back, you'll first need to cast the value to unsigned to avoid undefined behavior of left-shifting a negative value, then cast the result back to int before converting to float.

Note also that the conversion from unsigned to signed is implementation defined, however we're already dealing with ID as we're assuming float is IEEE754 and int is two's complement.

float rount_to_zero(int x)
{
    int cnt = 0;
    if (x >= 0) {
        while (x != (x & 0xffffff)) {
            x >>= 1;
            cnt++;
        }
        return x << cnt;
    } else {
        while (~0xffffff != (x & ~0xffffff)) {
            x >>= 1;
            cnt++;
        }
        return (int)((unsigned)x << cnt);
    }
}

`x != (x & 0xffffff)` is `1<<24 <= x`, and similarly for the negative case. — Eric Postpischil, Jul 31 '20 at 15:03

How can I convert an integer to float with rounding towards zero?

9 Answers9