What is the fastest way to convert float to int on x86

Question

What is the fastest way you know to convert a floating-point number to an int on an x86 CPU. Preferrably in C or assembly (that can be in-lined in C) for any combination of the following:

32/64/80-bit float -> 32/64-bit integer

I'm looking for some technique that is faster than to just let the compiler do it.

Switch from a Pentium 5 to a chip that does math right... (Man that makes me feel old...) — JBB, Sep 17 '08 at 00:21
I'm rolling around on the ground. Dang -- it's too bad people down-modded you for that! — Kevin, Sep 17 '08 at 17:29
:) Is there actually a Pentium 5? And if there is, so sorry it does have SSE3 and therefore is perfectly allright. When used wisely (see SSE3 and FISTTP comments). — akauppi, Mar 15 '09 at 11:36

score 18 · Accepted Answer · answered Sep 17 '08 at 00:34

It depends on if you want a truncating conversion or a rounding one and at what precision. By default, C will perform a truncating conversion when you go from float to int. There are FPU instructions that do it but it's not an ANSI C conversion and there are significant caveats to using it (such as knowing the FPU rounding state). Since the answer to your problem is quite complex and depends on some variables you haven't expressed, I recommend this article on the issue:

http://www.stereopsis.com/FPU.html

score 14 · Answer 2 · edited Jul 23 '14 at 17:45

14

Packed conversion using SSE is by far the fastest method, since you can convert multiple values in the same instruction. ffmpeg has a lot of assembly for this (mostly for converting the decoded output of audio to integer samples); check it for some examples.

edited Jul 23 '14 at 17:45

Matthieu

4,605
4
40
60

answered Sep 17 '08 at 00:24

Dark Shikari

7,941
4
26
38

It is a good suggestion however I will caveat it by saying it assumes two things: - That you have an x86 processor with SSE (>PII) or SSE2 (>PIII) - That you in fact do want a truncation, not a rounding, conversion – Zach Burlingame Sep 17 '08 at 00:40
Also note the limitation that this will of course not be an option for an 80-bit floating point value – PhiS Jul 20 '13 at 17:53

score 8 · Answer 3 · edited Apr 30 '13 at 16:10

A commonly used trick for plain x86/x87 code is to force the mantissa part of the float to represent the int. 32 bit version follows.

The 64-bit version is analogical. The Lua version posted above is faster, but relies on the truncation of double to a 32-bit result, therefore it requires the x87 unit to be set to double precision, and cannot be adapted for double to 64-bit int conversion.

The nice thing about this code is it is completely portable for all platforms conforming to IEEE 754, the only assumption made is the floating point rounding mode is set to nearest. Note: Portable in the sense it compiles and works. Platforms other than x86 usually do not benefit much from this technique, if at all.

static const float Snapper=3<<22;

union UFloatInt {
 int i;
 float f;
};

/** by Vlad Kaipetsky
portable assuming FP24 set to nearest rounding mode
efficient on x86 platform
*/
inline int toInt( float fval )
{
  Assert( fabs(fval)<=0x003fffff ); // only 23 bit values handled
  UFloatInt &fi = *(UFloatInt *)&fval;
  fi.f += Snapper;
  return ( (fi.i)&0x007fffff ) - 0x00400000;
}

For unsigned integer it can be simpler: inline uint32_t toInt( float fval ) { static float const snapper = 1<<23; fval += snapper; return (*(uint32_t*)fval) & 0x007FFFFF; } — chmike, May 01 '09 at 07:42
`static float const snapper;` makes this slower than necessary. Simply write `fval += 1<<23;` — R.. GitHub STOP HELPING ICE, Nov 25 '10 at 03:05
On x86 it is not slower, as the code generated is the same. There are no FPU instructions taking immediate arguments on x87. — Suma, Nov 25 '10 at 13:56

score 7 · Answer 4 · answered Mar 15 '09 at 11:34

If you can guarantee the CPU running your code is SSE3 compatible (even Pentium 5 is, JBB), you can allow the compiler to use its FISTTP instruction (i.e. -msse3 for gcc). It seems to do the thing like it should always have been done:

http://software.intel.com/en-us/articles/how-to-implement-the-fisttp-streaming-simd-extensions-3-instruction/

Note that FISTTP is different from FISTP (that has its problems, causing the slowness). It comes as part of SSE3 but is actually (the only) X87-side refinement.

Other then X86 CPU's would probably do the conversion just fine, anyways. :)

Processors with SSE3 support

score 7 · Answer 5 · answered Sep 17 '08 at 00:27

There is one instruction to convert a floating point to an int in assembly: use the FISTP instruction. It pops the value off the floating-point stack, converts it to an integer, and then stores at at the address specified. I don't think there would be a faster way (unless you use extended instruction sets like MMX or SSE, which I am not familiar with).

Another instruction, FIST, leaves the value on the FP stack but I'm not sure it works with quad-word sized destinations.

score 7 · Answer 6 · edited Dec 15 '14 at 08:59

The Lua code base has the following snippet to do this (check in src/luaconf.h from www.lua.org). If you find (SO finds) a faster way, I'm sure they'd be thrilled.

Oh, lua_Number means double. :)

/*
@@ lua_number2int is a macro to convert lua_Number to int.
@@ lua_number2integer is a macro to convert lua_Number to lua_Integer.
** CHANGE them if you know a faster way to convert a lua_Number to
** int (with any rounding method and without throwing errors) in your
** system. In Pentium machines, a naive typecast from double to int
** in C is extremely slow, so any alternative is worth trying.
*/

/* On a Pentium, resort to a trick */
#if defined(LUA_NUMBER_DOUBLE) && !defined(LUA_ANSI) && !defined(__SSE2__) && \
    (defined(__i386) || defined (_M_IX86) || defined(__i386__))

/* On a Microsoft compiler, use assembler */
#if defined(_MSC_VER)

#define lua_number2int(i,d)   __asm fld d   __asm fistp i
#define lua_number2integer(i,n)     lua_number2int(i, n)

/* the next trick should work on any Pentium, but sometimes clashes
   with a DirectX idiosyncrasy */
#else

union luai_Cast { double l_d; long l_l; };
#define lua_number2int(i,d) \
  { volatile union luai_Cast u; u.l_d = (d) + 6755399441055744.0; (i) = u.l_l; }
#define lua_number2integer(i,n)     lua_number2int(i, n)

#endif

/* this option always works, but may be slow */
#else
#define lua_number2int(i,d) ((i)=(int)(d))
#define lua_number2integer(i,d) ((i)=(lua_Integer)(d))

#endif

the swine · Answer 7 · 2014-02-26T17:37:48.493

I assume truncation is required, same as if one writes i = (int)f in "C".

If you have SSE3, you can use:

int convert(float x)
{
    int n;
    __asm {
        fld x
        fisttp n // the extra 't' means truncate
    }
    return n;
}

Alternately, with SSE2 (or in x64 where inline assembly might not be available), you can use almost as fast:

#include <xmmintrin.h>
int convert(float x)
{
    return _mm_cvtt_ss2si(_mm_load_ss(&x)); // extra 't' means truncate
}

On older computers there is an option to set the rounding mode manually and perform conversion using the ordinary fistp instruction. That will probably only work for arrays of floats, otherwise care must be taken to not use any constructs that would make the compiler change rounding mode (such as casting). It is done like this:

void Set_Trunc()
{
    // cw is a 16-bit register [_ _ _ ic rc1 rc0 pc1 pc0 iem _ pm um om zm dm im]
    __asm {
        push ax // use stack to store the control word
        fnstcw word ptr [esp]
        fwait // needed to make sure the control word is there
        mov ax, word ptr [esp] // or pop ax ...
        or ax, 0xc00 // set both rc bits (alternately "or ah, 0xc")
        mov word ptr [esp], ax // ... and push ax
        fldcw word ptr [esp]
        pop ax
    }
}

void convertArray(int *dest, const float *src, int n)
{
    Set_Trunc();
    __asm {
        mov eax, src
        mov edx, dest
        mov ecx, n // load loop variables

        cmp ecx, 0
        je bottom // handle zero-length arrays

    top:
        fld dword ptr [eax]
        fistp dword ptr [edx]
        loop top // decrement ecx, jump to top
    bottom:
    }
}

Note that the inline assembly only works with Microsoft's Visual Studio compilers (and maybe Borland), it would have to be rewritten to GNU assembly in order to compile with gcc. The SSE2 solution with intrinsics should be quite portable, however.

Other rounding modes are possible by different SSE2 intrinsics or by manually setting the FPU control word to a different rounding mode.

re inline assembly: yes Embarcadero (formerly Borland) does support it (both C++ and Delphi compilers do) — PhiS, Feb 26 '14 at 17:56

score 3 · Answer 8 · edited Jul 20 '13 at 07:25

Since MS scews us out of inline assembly in X64 and forces us to use intrinsics, I looked up which to use. MSDN doc gives _mm_cvtsd_si64x with an example.

The example works, but is horribly inefficient, using an unaligned load of 2 doubles, where we need just a single load, so getting rid of the additional alignment requirement. Then a lot of needless loads and reloads are produced, but they can be eliminated as follows:

 #include <intrin.h>
 #pragma intrinsic(_mm_cvtsd_si64x)
 long long _inline double2int(const double &d)
 {
     return _mm_cvtsd_si64x(*(__m128d*)&d);
 }

Result:

        i=double2int(d);
000000013F651085  cvtsd2si    rax,mmword ptr [rsp+38h]  
000000013F65108C  mov         qword ptr [rsp+28h],rax

The rounding mode can be set without inline assembly, e.g.

    _control87(_RC_NEAR,_MCW_RC);

where rounding to nearest is default (anyway).

The question whether to set the rounding mode at each call or to assume it will be restored (third party libs) will have to be answered by experience, I guess. You will have to include float.h for _control87() and related constants.

And, no, this will not work in 32 bits, so keep using the FISTP instruction:

_asm fld d
_asm fistp i

This is interesting, and appears to be correct, but in my tests the x64 compiler actually generates the *exact same code* (verified using a disassembler) for your code here and the MSDN example. — Cody Gray - on strike, Jul 20 '13 at 07:26

Don Neufeld · Answer 9 · 2008-09-17T00:47:32.630

3

If you really care about the speed of this make sure your compiler is generating the FIST instruction. In MSVC you can do this with /QIfist, see this MSDN overview

You can also consider using SSE intrinsics to do the work for you, see this article from Intel: http://softwarecommunity.intel.com/articles/eng/2076.htm

edited Sep 17 '08 at 00:47

answered Sep 17 '08 at 00:29

Don Neufeld

22,720
11
51
50

score -10 · Answer 10 · answered Sep 17 '08 at 00:35

-10

Generally, you can trust the compiler to be efficient and correct. There is usually nothing to be gained by rolling your own functions for something that already exists in the compiler.

answered Sep 17 '08 at 00:35

user14504

5
1

4

You are simply incorrect. In this case rolling your own is a very demonstrable 10x speed improvement over the built in functions because when you do it yourself you can trust the state of the FPU flags, which the built in _ftol does not do, or you can do it parallelized using SSE. – Don Neufeld Sep 17 '08 at 00:49
3

Or you can flag '-msse3' (gcc) and have the 'fixed' FTSTTP do it right, seamlessly. – akauppi Mar 15 '09 at 11:28
The compiler-supplied routines are not well suited for multimedia applications where performance is crucial – Nick Dowell Apr 07 '11 at 14:37

What is the fastest way to convert float to int on x86

10 Answers10

Linked

Related