Most efficient way to convert vector of uint32 to vector of float?

Question

x86 does not have an SSE instruction to convert from unsigned int32 to floating point. What would be the most efficient instruction sequence for achieving this?

EDIT: To clarify, i want to do the vector sequence of the following scalar operation:

unsigned int x = ...
float res = (float)x;

EDIT2: Here is a naive algorithm for doing a scalar conversion.

unsigned int x = ...
float bias = 0.f;
if (x > 0x7fffffff) {
    bias = (float)0x80000000;
    x -= 0x80000000;
}
res = signed_convert(x) + bias;

Do you mean truncating/rounding/...? Could you give a minimal example of the desired input/output? — Joachim Isaksson, Feb 05 '12 at 18:20
I'm confused, do you want to convert `int` to `float` or `float` to `int` or both? Could you correct the question title and/or body to make it less ambiguous? — Alexey Frunze, Feb 05 '12 at 19:08

Stephen Canon · Accepted Answer · 2012-02-05T20:54:59.673

Your naive scalar algorithm doesn't deliver a correctly-rounded conversion -- it will suffer from double rounding on certain inputs. As an example: if x is 0x88000081, then the correctly-rounded result of conversion to float is 2281701632.0f, but your scalar algorithm will return 2281701376.0f instead.

Off the top of my head, you can do a correct conversion as follows (as I said, this is off the top of my head, so it's likely possible to save an instruction somewhere):

movdqa   xmm1,  xmm0    // make a copy of x
psrld    xmm0,  16      // high 16 bits of x
pand     xmm1, [mask]   // low 16 bits of x
orps     xmm0, [onep39] // float(2^39 + high 16 bits of x)
cvtdq2ps xmm1, xmm1     // float(low 16 bits of x)
subps    xmm0, [onep39] // float(high 16 bits of x)
addps    xmm0,  xmm1    // float(x)

where the constants have the following values:

mask:   0000ffff 0000ffff 0000ffff 0000ffff
onep39: 53000000 53000000 53000000 53000000

What this does is separately convert the high- and low-halves of each lane to floating-point, then add these converted values together. Because each half is only 16 bits wide, the conversion to float does not incur any rounding. Rounding only occurs when the two halves are added; because addition is a correctly-rounded operation, the entire conversion is correctly rounded.

By contrast, your naive implementation first converts the low 31 bits to float, which incurs a rounding, then conditionally adds 2^31 to that result, which may cause a second rounding. Any time you have two separate rounding points in a conversion, unless you are exceedingly careful about how they occur, you should not expect the result to be correctly rounded.

At first glance, I don't understand the math. Why does your recipe give the correct answer? Not that I am saying it is incorrect... — zr., Feb 05 '12 at 20:41
@zr. Each individual step is pretty straightforward, so if you sit down and follow it through it should be fairly self-explanatory. That said, if there's a specific step that you're having trouble with, I'd be happy to clarify why it does what the comments say it does. — Stephen Canon, Feb 05 '12 at 21:02
Would you please explain how the upper 16-bits are handled? I don't understand the OR followed by the SUB — zr., Feb 05 '12 at 21:13
@zr.: If you look at the significand bits of the floating-point number 2^39, bit i has value 2^(16+i). Thus, 2^39 | (x >> 16) is precisely (2^39 + (x & 0xffff0000)) as a floating-point number, without needing to do an explicit conversion. Subtracting 2^39 removes the bias, leaving (float)(x & 0xffff0000). — Stephen Canon, Feb 06 '12 at 14:57

score 1 · Answer 2 · answered Mar 14 '18 at 22:28

1

This wasn't available when you asked, but AVX512F added vcvtudq2ps.

answered Mar 14 '18 at 22:28

ZachB

13,051
4
61
89

Preferably you should quote some of what is at the link so that if the link disappears or changes the answer will still be useful. As it is this is more or less a link only answer. – Michael Petch Mar 14 '18 at 22:45

score 1 · Answer 3 · answered Feb 06 '12 at 09:58

1

This is based on an example from the old but useful Apple AltiVec-SSE migration documentation which unfortunately is now no longer available at http://developer.apple.com:

inline __m128 _mm_ctf_epu32(const __m128i v)
{
    const __m128 two16 = _mm_set1_ps(0x1.0p16f);

    // Avoid double rounding by doing two exact conversions
    // of high and low 16-bit segments
    const __m128i hi = _mm_srli_epi32((__m128i)v, 16);
    const __m128i lo = _mm_srli_epi32(_mm_slli_epi32((__m128i)v, 16), 16);
    const __m128 fHi = _mm_mul_ps(_mm_cvtepi32_ps(hi), two16);
    const __m128 fLo = _mm_cvtepi32_ps(lo);

    // do single rounding according to current rounding mode
    return _mm_add_ps(fHi, fLo);
}

answered Feb 06 '12 at 09:58

Paul R

208,748
37
389
560

This, too, is a great answer. I wonder how it compares to Stephen Canon's solution in terms of accuracy and performance. – zr. Feb 06 '12 at 11:21
The other solution looks good, but the code above has the advantage of having been tested, and furthermore it uses intrinsics, which makes it a little more portable. Probably not much difference when it comes to performance though. – Paul R Feb 06 '12 at 11:31
If you're doing this in a loop, setting `lo` with 2 shifts will be a bottleneck vs. `_mm_and_si128(v, _mm_set_epi32(0x0000FFFF))` (like Stephen's answer). Haswell for example can only do 1 vector shift per clock cycle, and FPU instructions compete for port 0 as well. `_mm_and_si128` can run on port 5, which none of the other instructions use. – Peter Cordes Mar 14 '18 at 22:37

Most efficient way to convert vector of uint32 to vector of float?

3 Answers3

Linked