7

Conversion from float to int with rounding happens fairly often in C++ code that works with floating point data. One use, for example, is in generating conversion tables.

Consider this snippet of code:

// Convert a positive float value and round to the nearest integer
int RoundedIntValue = (int) (FloatValue + 0.5f);

The C/C++ language defines the (int) cast as truncating, so the 0.5f must be added to ensure rounding up to the nearest positive integer (when the input is positive). For the above, VS2015's compiler generates the following code:

movss   xmm9, DWORD PTR __real@3f000000 // 0.5f
addss   xmm0, xmm9
cvttss2si   eax, xmm0

The above works, but could be more efficient...

Intel's designers apparently thought it was important enough a problem to solve with a single instruction that will do just what's needed: Convert to the nearest integer value: cvtss2si (note, just one 't' in the mnemonic).

If the cvtss2si were to replace the cvttss2si instruction in the above sequence two of the three instructions would just be eliminated (as would the use of an extra xmm register, which could result in better optimization overall).

So how can we code C++ statement(s) to get this simple job done with the one cvtss2si instruction?

I've been poking around, trying things like the following but even with the optimizer on task it doesn't boil down to the one machine instruction that could/should do the job:

int RoundedIntValue = _mm_cvt_ss2si(_mm_set_ss(FloatValue));

Unfortunately the above seems bent on clearing out a whole vector of registers that will never be used, instead of just using the one 32 bit value.

movaps  xmm1, xmm0
xorps   xmm2, xmm2
movss   xmm2, xmm1
cvtss2si eax, xmm2

Perhaps I'm missing an obvious approach here.

Can you offer a suggested set of C++ instructions that will ultimately generate the single cvtss2si instruction?

NoelC
  • 1,407
  • 10
  • 16
  • Note that rounding mode is controlled separately, so `cvtss2si` will not necessarily round up to the nearest integer. – Jester Jan 04 '17 at 02:11
  • What do you mean by "nearest positive value"? If `FloatValue` is `-199.8`, adding 0.5 is not going to get you a positive value. – Kerrek SB Jan 04 '17 at 02:12
  • Have you considered using `lroundf()` from `math.h`? – fuz Jan 04 '17 at 02:17
  • 2
    Also `lrintf` which, when using `gcc -fno-math-errno` will generate a single `cvtss2si` instruction. Not so for `clang` or `icc`. – Jester Jan 04 '17 at 02:21
  • Assume the input values are positive in the example that uses + 0.5f and truncates. However, a more general solution, that works with negative numbers - just such as what cvtss2si would do - will be preferable. Thanks for the thought to use gcc, but In this particular case I need Visual Studio to compile that cvtss2si instruction. Unfortunately, lroundf() just generates a call lroundf instruction, and I have the project set up to optimize intrinsics as far as I can tell. – NoelC Jan 04 '17 at 02:25
  • 4
    Actually your `_mm_cvt_ss2si` intrinsic version does work for both `gcc` and `clang`. Looks like `icc` and visual studio are too stupid for this ;) – Jester Jan 04 '17 at 02:28
  • Just to confirm, in my environment the default rounding mode is in force, so the cvtss2si instruction would do just what's needed. – NoelC Jan 04 '17 at 02:31
  • 2
    Note that `(int)(floatvalue+0.5)` is not just inefficient, it's *incorrect*. There are a numerous `float` or `double` values for which this does not correctly round. – EOF Jan 04 '17 at 11:12
  • Uh, no, not that I can see. There are none I know of within the ranges of relatively small positive values we're interested in where the logic of adding a half and truncating is incorrect. Negative numbers have already been covered. Are you thinking of something else? – NoelC Jan 04 '17 at 18:40
  • @NoelC I didn't see your comment until now, since you didn't `@EOF` it. Anyway, have a look at this simple program: `int main(void) { double a = nextafter(0.5, 0); printf("correctly rounded:\t%e\n", round(a)); printf("incorrectly rounded:\t%e\n", trunc(a+0.5)); }`, see if you can still claim that adding 0.5 causes correct rounding for "small positive values". – EOF Aug 25 '17 at 18:22
  • @EOF, thanks for the pedantic example, though if you go one more iteration of nextafter the numbers come out right, so your example is just flirting with the precision of the least significant bit. We call that close enough, since we expect to lose precision at the LSB with every math operation. It's an issue in the same realm as the potential for precision loss by subtracting a very small number from 1.0. I'm here complaining about one extra instruction - we're certainly not willing to waste the CPU time to call a library function to ensure that 0.49999997f is rounded down instead of up! – NoelC Aug 26 '17 at 23:19

2 Answers2

6

This is an optimization defect in Microsoft's compiler, and the bug has been reported to Microsoft. As other commentators have mentioned, modern versions of GCC, Clang, and ICC all produce the expected code. For a function like:

int RoundToNearestEven(float value)
{
   return _mm_cvt_ss2si(_mm_set_ss(value));  
}

all compilers but Microsoft's will emit the following object code:

cvtss2si  eax, xmm0
ret

whereas Microsoft's compiler (as of VS 2015 Update 3) emits the following:

movaps    xmm1, xmm0
xorps     xmm2, xmm2
movss     xmm2, xmm1
cvtss2si  eax,  xmm2
ret

The same is seen for the double-precision version, cvtsd2si (i.e., the _mm_cvtsd_si32 intrinsic).

Until such time as the optimizer is improved, there is no faster alternative available. Fortunately, the code currently being generated is not as slow as it might seem. Moving and register-clearing are among the fastest possible instructions, and several of these can probably be implemented solely in the front end as register renames. And it is certainly faster than any of the possible alternatives—often by orders of magnitude:

  • The trick of adding 0.5 that you mentioned will not only be slower because it has to load a constant and perform an addition, it will also not produce the correctly rounded result in all cases.

  • Using the _mm_load_ss intrinsic to load the floating-point value into an __m128 structure suitable to be used with the _mm_cvt_ss2si intrinsic is a pessimization because it causes a spill to memory, rather than just a register-to-register move.

    (Note that while _mm_set_ss is always better for x86-64, where the calling convention uses SSE registers to pass floating-point values, I have occasionally observed that _mm_load_ss will produce more optimal code in x86-32 builds than _mm_set_ss, but it is highly dependent upon multiple factors and has only been observed when multiple intrinsics are used in a complicated sequence of code. Your default choice should be _mm_set_ss.)

  • Substituting a reinterpret_cast<__m128&>(value) (or moral equivalent) for the _mm_set_ss intrinsic is both unsafe and inefficient. It results in a spill from the SSE register to memory; the cvtss2si instruction then uses that memory location as its source operand.

  • Declaring a temporary __m128 structure and value-initializing it is safe, but even more inefficient. Space is allocated on the stack for the entire structure, and then each slot is filled with either 0 or the floating-point value. This structure's memory location is then used as the source operand for cvtss2si.

  • The lrint family of functions provided by the C standard library should do what you want, and in fact compile to straightforward cvt* instructions on some other compilers, but are extremely sub-optimal on Microsoft's compiler. They are never inlined, so you always pay the cost of a function call. Plus, the code inside of the function is sub-optimal. Both of these have been reported as bugs, but we are still awaiting a fix. There are similar problems with other conversion functions provided by the standard library, including lround and friends.

  • The x87 FPU offers a FIST/FISTP instruction that performs a similar task, but the C and C++ language standards require that a cast truncate, rather than round-to-nearest-even (the default FPU rounding mode), so the compiler is obligated to insert a bunch of code to change the current rounding mode, perform the conversion, and then change it back. This is extremely slow, and there's no way to instruct the compiler not to do it except by using inline assembly. Beyond the fact that inline assembly is not available with the 64-bit compiler, MSVC's inline assembly syntax also offers no way to specify inputs and outputs, so you pay double load and store penalties both ways. And even if this weren't the case, you'd still have to pay the cost of copying the floating-point value from an SSE register, into memory, and then onto the x87 FPU stack.

Intrinsics are great, and can often allow you to produce code that is faster than what would otherwise be generated by the compiler, but they are not perfect. If you're like me and find yourself frequently analyzing the disassembly for your binaries, you will find yourself frequently disappointed. Nevertheless, your best choice here is to use the intrinsic.

As for why the optimizer emits the code in the way that it does, I can only speculate since I don't work on the Microsoft compiler team, but my guess would be because a number of the other cvt* instructions have false dependencies that the code-generator needs to work around. For example, a cvtss2sd does not modify the upper 64 bits of the destination XMM register. Such partial register updates cause stalls and reduce the opportunity for instruction-level parallelism. This is especially a problem in loops, where the upper bits of the register form a second loop-carried dependency chain, even though we don't actually care about their contents. Because execution of the cvtss2sd instruction cannot begin until the preceding instruction has completed, latency is vastly increased. However, by executing an xorss or movss instruction first, the register's upper bits are cleared, thus breaking dependencies and avoiding the possibility for a stall. This is an example of an interesting case where shorter code does not equate to faster code. The compiler team started inserting these dependency-breaking instructions for scalar conversions in the compiler shipped with VS 2010, and probably applied the heuristic overzealously.

Community
  • 1
  • 1
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • The only reason the + 0.5f method was mentioned is that we have a fair number of places (e.g., creation of translation tables) where all the values are positive. Ideally, an inline that just gives an efficient and worry-free conversion of either positive or negative values would be preferable. – NoelC Jan 04 '17 at 16:26
  • By the way, I can't really mark this as "the answer" since it doesn't actually accomplish getting the proper instruction out of Microsoft's compiler. I appreciate you taking your time to write it up, Cody. Maybe Microsoft will see it and be embarrassed into improving the compiler at some point in the future. – NoelC Jan 04 '17 at 18:21
  • My thinking when writing the answer was that "it is impossible" is an answer, even though it might not be the answer you want. But don't feel compelled to accept any answer that you don't like. You can submit a bug report to Microsoft if you like: http://connect.microsoft.com/, but my experience has been pretty much that they ignore anything that isn't a showstopper. – Cody Gray - on strike Jan 04 '17 at 18:22
  • Fair enough. I think I'll leave it a few days and see if any other heretofore unthought of tricks emerge, then I'll mark your answer as a correct assessment of "you just can't get there from here". Meanwhile, from the "assumptions must be changed" angle I'm going to look into whether changing to clang might be feasible. I've heard other good things about it as well. – NoelC Jan 05 '17 at 00:44
  • If you ever did (or do) decide to file a bug report with Microsoft, feel free to leave a link to it here. I'll edit it into my answer, or you can post an answer of your own with that link and accept it instead. I won't have my feelings hurt! Either way, getting the link out there will encourage more people to upvote it, and therefore encourage MS to fix the problem. If you haven't already submitted a bug report and/or don't want to, I might do so myself. @noel – Cody Gray - on strike Jan 10 '17 at 09:49
  • Good idea. The bug report is: https://connect.microsoft.com/VisualStudio/feedback/details/3118553 – NoelC Jan 11 '17 at 12:01
  • By the way, I've found cases where Microsoft's VS 2017 compiler will emit the minimal instruction sequence, but I can't make that happen in our actual production code - which isn't anything special, it just generates tables of numbers. And so we have the + 0.5f method which works fine. – NoelC Aug 26 '17 at 23:23
  • What I noticed was that, although MSVC 2015 will add the extra instructions, when it inlines the code it does what we were expecting and simply calls `cvtss2si` – user1593842 Sep 28 '17 at 16:49
  • It's hit or miss. We have a situation where even with the latest MSVC 2017 it's inlining with the extra instructions shown above. – NoelC Dec 19 '17 at 18:59
  • *why the optimizer emits the code in the way that it does* - looks pretty obvious to me: it doesn't optimize away `_mm_set_ss()` zeroing the top 3 elements. It merges the low dword into a zeroed vector. GCC often has the same missed optimization, although not in this specific case. e.g. [How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?](//stackoverflow.com/q/39318496) The initial `movaps` makes no sense though; it doesn't need another copy of the source reg. – Peter Cordes Jul 11 '19 at 20:51
4

Visual Studio 15.6, released today, appears to finally correct this issue. We now see a single instruction used when inlining this function:

inline int ConvertFloatToRoundedInt(float FloatValue)
{
    return _mm_cvt_ss2si(_mm_set_ss(FloatValue)); // Convert to integer with rounding
}

I'm impressed that Microsoft finally got a round tuit.

NoelC
  • 1,407
  • 10
  • 16