6

VS2019, Release, x86.

template <int i> float get() const {
    int f = _mm_extract_ps(fmm, i);
    return (float const&)f;
}

When use return (float&)f; compiler uses

extractps m32, ...
movss xmm0, m32

.correct result

When use return (float const&)f; compiler uses

extractps eax, ...
movd xmm0, eax

.wrong result

The main idea that T& and T const& is at first T then const. Const is just some kind of agreement for programmers. You know that you can get around it. But there is NO any const in assembly code, but type float IS. And I think that for both float& and float const& it MUST be float representation (cpu register) in assembly. We can use intermediate int reg32, but the final interpretation must be float.

And at this time it looks like regression, because this worked fine before. And also using float& in this case is definitely strange, because we shouldn't case about float const& safety but temp var for float& is really questionable.

Microsoft answered:

Hi Truthfinder, thanks for the self-contained repro. As it happens, this behavior is actually correct. As my colleague @Xiang Fan [MSFT] described in an internal email:

The conversions performed by [a c-style cast] tries the following sequence: (4.1) — a const_cast (7.6.1.11), (4.2) — a static_cast (7.6.1.9), (4.3) — a static_cast followed by a const_cast, (4.4) — a reinterpret_cast (7.6.1.10), or (4.5) — a reinterpret_cast followed by a const_cast,

If a conversion can be interpreted in more than one of the ways listed above, the interpretation that appears first in the list is used.

So in your case, (const float &) is converted to static_cast, which has the effect "the initializer expression is implicitly converted to a prvalue of type “cv1 T1”. The temporary materialization conversion is applied and the reference is bound to the result."

But in the other case, (float &) is converted to reinterpret_cast because static_cast isn’t valid, which is the same as reinterpret_cast(&operand).

The actual "bug" you're observing is that one cast does: "transform the float-typed value "1.0" into the equivalent int-typed value "1"", while the other cast says "find the bit representation of 1.0 as a float, and then interpret those bits as an int".

For this reason we recommend against c-style casts.

Thanks!

MS forum link: https://developercommunity.visualstudio.com/content/problem/411552/extract-ps-intrinsics-bug.html

Any ideas?

P.S. What do I really want:

float val = _mm_extract_ps(xmm, 3);

In manual assembly I can write: extractps val, xmm0, 3 where val is float 32 memory variable. Only ONE! instruction. I want see the same result in compiler generated assembly code. No shuffles or any other excessive instructions. The most bad acceptable case is: extractps reg32, xmm0, 3; mov val, reg32.

My point about T& and T const&: The type of variable must be the SAME for both cases. But now float& will interpret m32 as float32 and float const& will interpret m32 as int32.

int main() {
    int z = 1;
    float x = (float&)z;
    float y = (float const&)z;
    printf("%f %f %i", x, y, x==y);
    return 0;
}

Out: 0.000000 1.000000 0

Is that really OK?

Best regards, Truthfinder

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • If 4.2 is used for the const-ref, why wouldn't 4.3 be used for the mutable-ref? – JVApen Jan 31 '19 at 07:22
  • PS: I agree with the C-style cast remark, don't use it, it only gives you bugs. Why ain't you casting to a bare float, without the reference? – JVApen Jan 31 '19 at 07:23
  • Please let me not to agree. Because it's C++ at first. C-style cast is not a good practice for C++. C-style in C++ is a crutch again. – truthfinder Jan 31 '19 at 07:43
  • 1
    In your real code, does the resulting `float` need to get stored to memory instead of being in an xmm reg where you can use it? You wrote a function that returns a float, instead of storing into a `float*`, so any asm you look at from that will obviously finish with the result in a register. Storing a float to memory is one of the few uses for `extractps`. And yes, maybe on some future CPU it will be only 1 uop total, and thus better than `shufps` + `movss` for a memory dst. gcc/clang will use extractps for the code in my answer with `*out=...` https://godbolt.org/z/oLgBd4 but not MSVC – Peter Cordes Jan 31 '19 at 09:56
  • 1
    And BTW, `extractps reg32, xmm0, 3` / `mov val, reg32` is 3 uops including a shuffle as part of extractps, far worse than `shufps` / `movss` (especially on Bulldozer-family where latency between XMM and scalar integer is high, although if you're just storing then OoO exec can hide it if you don't reload soon). IDK why you think that would be acceptable. – Peter Cordes Jan 31 '19 at 10:10

4 Answers4

11

There's an interesting question about C++ cast semantics (which Microsoft already briefly answered for you), but it's mixed up with your misuse of _mm_extract_ps resulting in needing a type-pun in the first place. (And only showing asm that is equivalent, omitting the int->float conversion.) If someone else wants to expand on the standard-ese in another answer, that would be great.

TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.

template <int i> float get(__m128 input) {
    __m128 tmp = input;
    if (i)     // constexpr i means this branch is compile-time-only
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

If you actually have a memory-destination use case, you should be looking at asm for a function that takes a float* output arg, not a function that needs the result in xmm0. (And yes, that is a use-case for the extractps instruction, but arguably not the _mm_extract_ps intrinsics. gcc and clang use extractps when optimizing *out = get<2>(in), although MSVC misses that and still uses shufps + movss.)


Both blocks of asm you show are simply copying the low 32 bits of xmm0 somewhere, with no conversion to int. You left out the important different, and only showed the part that just uselessly copies the float bit-pattern out of xmm0 and then back, in 2 different ways (to register or to memory). movd is a pure copy of the bits unmodified, just like the movss load.

It's the compiler's choice which to use, after you force it to use extractps at all. Going through a register and back is lower latency than store/reload, but more ALU uops.

The (float const&) attempt to type-pun does include a conversion from FP to integer, which you didn't show. As if we needed any more reason to avoid pointer/reference casting for type-punning, this really does mean something different: (float const&)f takes the integer bit-pattern (from _mm_extract_ps) as an int and converts that to float.

I put your code on the Godbolt compiler explorer to see what you left out.

float get1_with_extractps_const(__m128 fmm) {
    int f = _mm_extract_ps(fmm, 1);
    return (float const&)f;
}

;; from MSVC -O2 -Gv  (vectorcall passes __m128 in xmm0)
float get1_with_extractps_const(__m128) PROC   ; get1_with_extractps_const, COMDAT
    extractps eax, xmm0, 1   ; copy the bit-pattern to eax

    movd    xmm0, eax      ; these 2 insns are an alternative to pxor xmm0,xmm0 + cvtsi2ss xmm0,eax to avoid false deps and zero the upper elements
    cvtdq2ps xmm0, xmm0    ; packed conversion is 1 uop
    ret     0

GCC compiles it this way:

get1_with_extractps_const(float __vector(4)):    # gcc8.2 -O3 -msse4
        extractps       eax, xmm0, 1
        pxor    xmm0, xmm0            ; cvtsi2ss has an output dependency so gcc always does this
        cvtsi2ss        xmm0, eax     ; MSVC's way is probably better for float.
        ret

Apparently MSVC does define the behaviour of pointer/reference casting for type-punning. Plain ISO C++ doesn't (strict aliasing UB), and neither do other compilers. Use memcpy to type-pun, or a union (which GNU C and MSVC support in C++ as an extension). Of course in this case, type-punning the vector element you want to an integer and back is horrible.

Only for (float &)f does gcc warn about the strict-aliasing violation. And GCC / clang agree with MSVC that only this version is a type-pun, not materializing a float from an implicit conversion. C++ is weird!

float get1_with_extractps_nonconst(__m128 fmm) {
    int f = _mm_extract_ps(fmm, 1);
    return (float &)f;
}

<source>: In function 'float get_with_extractps_nonconst(__m128)':
<source>:21:21: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
     return (float &)f;
                     ^

gcc optimizes away the extractps altogether.

# gcc8.2 -O3 -msse4
get1_with_extractps_nonconst(float __vector(4)):
    shufps  xmm0, xmm0, 85    ; 0x55 = broadcast element 1 to all elements
    ret

Clang uses SSE3 movshdup to copy element 1 to 0. (And element 3 to 2). But MSVC doesn't, which is another reason to never use this:

float get1_with_extractps_nonconst(__m128) PROC
    extractps DWORD PTR f$[rsp], xmm0, 1     ; store
    movss   xmm0, DWORD PTR f$[rsp]          ; reload
    ret     0

Don't use _mm_extract_ps for this

Both of your versions are horrible because this is not what _mm_extract_ps or extractps are for. Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?

A float in a register is the same thing as the low element of a vector. The high elements don't need to be zeroed. And if they did, you'd want to use insertps which can do xmm,xmm and zero elements according to an immediate.

Use _mm_shuffle_ps to bring the element you want to the low position of a register, and then it is a scalar float. (And you can tell a C++ compiler that with _mm_cvtss_f32). This should compile to just shufps xmm0,xmm0,2, without an extractps or any mov.

template <int i> float get() const {
    __m128 tmp = fmm;
    if (i)                               // i=0 means the element is already in place
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // else shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

(I skipped using _MM_SHUFFLE(0,0,0,i) because that's equal to i.)

If your fmm was in memory, not a register, then hopefully compilers would optimize away the shuffle and just movss xmm0, [mem]. MSVC 19.14 does manage to do that, at least for the function-arg on the stack case. I didn't test other compilers, but clang should probably manage to optimize away the _mm_shuffle_ps; it's very good at seeing through shuffles.

Test-case proving this compiles efficiently

e.g. a test-case with a non-class-member version of your function, and a caller that inlines it for a specific i:

#include <immintrin.h>

template <int i> float get(__m128 input) {
    __m128 tmp = input;
    if (i)                  // i=0 means the element is already in place
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // else shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

// MSVC -Gv (vectorcall) passes arg in xmm0
// With plain dumb x64 fastcall, arg is on the stack, and it *does* just MOVSS load without shuffling
float get2(__m128 in) {
    return get<2>(in);
}

From the Godbolt compiler explorer, asm output from MSVC, clang, and gcc:

;; MSVC -O2 -Gv
float get<2>(__m128) PROC               ; get<2>, COMDAT
        shufps  xmm0, xmm0, 2
        ret     0
float get<2>(__m128) ENDP               ; get<2>

;; MSVC -O2  (without Gv, so the vector comes from memory)
input$ = 8
float get<2>(__m128) PROC               ; get<2>, COMDAT
        movss   xmm0, DWORD PTR [rcx+8]
        ret     0
float get<2>(__m128) ENDP               ; get<2>
# gcc8.2 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
        shufps  xmm0, xmm0, 2   # with -msse4, we get unpckhps
        ret
# clang7.0 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
        unpckhpd        xmm0, xmm0      # xmm0 = xmm0[1,1]
        ret

clang's shuffle optimizer simplifies to unpckhpd, which is faster on some old CPUs. Unfortunately it didn't notice it could have used movhlps xmm0,xmm0, which is also fast and 1 byte shorter.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • No, that's definitely not the case. In assembly language I can extract any f32 float xmm by index without shuffle. So please do not recommend me to and additional excessive instructions to C++. This is definitely not a good solution. It looks like a crutch to go around issue which is the result of compiler bug. – truthfinder Jan 31 '19 at 07:29
  • Why should I pay 1 CPU clock on PORT5 per iteration in time critical code due to C++ compiler issues? – truthfinder Jan 31 '19 at 07:47
  • @Peter did you read Microsoft's explanation? It explains how the C-style cast behaves entirely differently in the two cases. Type punning takes place in only one of the cases. There is no question really, there is no bug. – rustyx Jan 31 '19 at 07:54
  • @rustyx: I read the explanation, but it doesn't make any sense if the asm the question quotes is correct. Both of them use `extractps`, just with different destinations and a `movd` or `movss` back into xmm0 – Peter Cordes Jan 31 '19 at 07:56
  • 3
    @truthfinder: `extractps` is slower than `shufps` on Skylake; it costs 2 uops and one of them needs port 5 (https://agner.org/optimize/). And then to get the value back into an XMM register, you need a 3rd uop. My version is 1 uop *total*, not involving `extractps`. Did you even read [Intel SSE: Why does \`\_mm\_extract\_ps\` return \`int\` instead of \`float\`?](//stackoverflow.com/q/5526658) It's not the instruction you want to get a vector element as a scalar float. – Peter Cordes Jan 31 '19 at 08:00
  • @PeterCordes Yes, but now final answer there. 2 uops maybe changed in future. At this point my code with psrldq/movss is more faster then all cases with extractps and shufps. But the ideological problem with T& and T const& is still there. The type of variable must be the SAME for both cases. But now float& will interpret m32 as float32 and float const& will interpret m32 as int32. Is that really OK? – truthfinder Jan 31 '19 at 08:17
  • 2
    @rustyx: just tried the source on Godbolt, yes one version does include taking the bit-pattern as an integer and converting that to `float`. The OP confusingly left that out of the question! Updated my answer. – Peter Cordes Jan 31 '19 at 09:01
  • 3
    @truthfinder: extracting a vector element as a scalar needs at most 1 instruction, but using `extractps` at all means you need at least 2. This is totally separate from the question of what the C++ reference casts mean. Maybe someone else wants to walk you through Microsoft's explanation of the C++ language rules (which lead to really surprising behaviour in this case, but gcc/clang agree with MSVC here). You seem to be missing the point of my answer re: efficiency. – Peter Cordes Jan 31 '19 at 09:09
  • @Peter Ok, forget about SSE. int main() { int z = 1; float x = (float&)z; float y = (float const&)z; printf("%f %f %i", x, y, x==y); } Try it on on different compilers. – truthfinder Jan 31 '19 at 09:40
  • if x != y is ok in this situation, then no questions to _mm_extract_ps too. – truthfinder Jan 31 '19 at 09:49
  • 1
    @truthfinder: Yes, we know from this test-case that those two casts mean different things. One of them type-puns to `float`, the other is equivalent to `(float)z`. Microsoft already explained why, with reference to the standard: an extra implicit conversion from int to float is part of the candidate cast that matches first. – Peter Cordes Jan 31 '19 at 09:51
8

My point about T& and T const&: The type of variable must be the SAME for both cases.

As Microsoft's support tried to explain, no these are NOT the same. It's how C++ works.

You are using a C-style cast ( ... ), which in C++ breaks down into a series of attempts to use different C++ casts in decreasing order of safety:

  • (4.1) — a const_cast
  • (4.2) — a static_cast
  • (4.3) — a static_cast followed by a const_cast
  • (4.4) — a reinterpret_cast
  • (4.5) — a reinterpret_cast followed by a const_cast

In the case of (float const&) b (where b is an int):

  • We try const_cast<float const&>(b); - no luck (float vs. int)
  • We try static_cast<float const&>(b); - voila! (after an implicit standard conversion of b to a temporary float - remember that C++ allows itself to perform two standard and one user-defined conversions per expression implicitly)

In the case of (float&) b (again where b is an int):

  • We try const_cast<float&>(b); - no luck
  • We try static_cast<float&>(b); - no luck (after an implicit standard conversion of b to a temporary float, it won't bind to a non-const lvalue reference)
  • We try const_cast<float&>(static_cast<float&>(b)); - no luck
  • We try reinterpret_cast<float&>(b); - voila!

Strict aliasing rule aside1, here's an example that demonstrates this behavior:

#include <iostream>

int main() {
    float a = 1.2345f;
    int b = reinterpret_cast<int&>(a); // this type-pun is built into _mm_extract_ps
    float nc = (float&)b;
    float cc = (float const&)b;
    float rc = reinterpret_cast<float&>(b);
    float sc = static_cast<float const&>(b);
    std::cout << "a=" << a << " b=" << b << std::endl;
    std::cout << "nc=" << nc << " cc=" << cc << std::endl;
    std::cout << "rc=" << rc << " sc=" << sc << std::endl;
}

Prints:

a=1.2345 b=1067320345
nc=1.2345 cc=1.06732e+09
rc=1.2345 sc=1.06732e+09

LIVE DEMO

That's why you should not use C-style casts in C++. Less typing, but much more headache.

Also don't use _mm_extract_ps - the reason why it returns an int is because the extractps instruction copies a float to a generic register - this is not what you want, since to use a float it must be copied back to a floating-point register. So doing this is a waste of time. As Peter Cordes explains, use _mm_cvtss_f32(_mm_shuffle_ps()) instead, which compiles to a single instruction.


1 Technically speaking, using reinterpret_cast to circumvent the C++ type system (a.k.a. type punning) is undefined behavior in ISO C++. However, MSVC relaxes this rule as a compiler extension. So the code is correct, as long as it's compiled with MSVC or elsewhere where the strict aliasing rule can be turned off (e.g. -fno-strict-aliasing). The standard way to type-pun without falling into the strict aliasing trap is through memcpy().

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • 1
    Probably worth mentioning that `(float&)b;` and the reinterpret casts are undefined behaviour in ISO C++ (strict aliasing violation). MSVC chooses to define that behaviour so you can write that instead of `memcpy`, but other x86 compilers don't. GCC warns about the strict-aliasing violating. (Unless I and GCC are mistaken here, and ISO C++ *does* let us use `reinterpret_cast` with reference types this way to safely type-pun non-pointer objects, instead of just to change pointer types. That would be nice, but I don't think it's the case.) – Peter Cordes Jan 31 '19 at 10:07
  • 3
    If you care about other compilers, just use `memcpy(&b, &a, sizeof(b))`. Modern compilers know how to fully inline that and optimize it the same as a cast. Or use a union; MSVC and gcc/clang/icc all support union type-punning as an extension to C++, like in ISO C99. Potentially slowing down other parts of your code with `-fno-strict-aliasing` doesn't seem worth it. (With reloads of things inside loops, and/or stopping auto-vectorization, unless you use `float *__restrict foo` to guarantee no aliasing.) – Peter Cordes Jan 31 '19 at 10:19
  • 1
    Yep, that's probably better. Anyway just tried to address OP's confusion about the casts. As you explained the choice of `_mm_extract_ps` was suboptimal to begin with. – rustyx Jan 31 '19 at 10:58
  • Oh yeah, this answer very nicely explains what's going on. But like my answer for `_mm_extract_ps`, you can explain what it does and why, *and* recommend against it. :P – Peter Cordes Jan 31 '19 at 11:01
0

I see somebody likes to set minuses. It looks like that I was almost right about *(float*)&. But the better way of course is to use standard intrin.h solution for cross compilation. MSVS, smmintrin.h:

#define _MM_EXTRACT_FLOAT(dest, src, ndx) \
        *((int*)&(dest)) = _mm_extract_ps((src), (ndx))

As you can see, there is an official macros for this purpose. It can be different for other platforms of course. Still wondering why Intel chose such solution, but that's another question anyway.

  • Interesting, GCC's `smmintrin.h` (included by `immintrin.h`) also defines `_MM_EXTRACT_FLOAT(D, S, N)`. It uses `{ (D) = __builtin_ia32_vec_ext_v4sf ((__v4sf)(S), (N)); }`, so definitely only ever use the macro, don't copy the definition. BTW, Intel's intrinsics guide doesn't document it, though. – Peter Cordes Feb 05 '19 at 01:44
  • @PeterCordes Did you find any other common macro in Intel's intrinsics guide? _MM_SHUFFLE for example. – truthfinder Feb 06 '19 at 09:56
-2

Ok. Sounds like the idea when float val = _mm_extract_ps(xmm, 3) can be compiled to the single extractps val, xmm0, 3 instruction is not reachable.

And I still use *(float*)&intval because it will work predictably on any msvc version.

As for int _mm_extract_ps it definitely bad design. _ps is used float type and epi32 is used for int32 type. Instruction extractps is not typed, so it must be two different functions int _mm_extract_epi32(__m128i(), 3) and float _mm_extract_ps(__m128(), 3).

P.S. http://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/

I don't know why this solution was taken by language committee or anybody else, but memcpy is just not beautiful. And also I'm sure it creates additional problems for compiler, and there is no way for single instruction result. As I understand, the recommended solution is int i = _mm_extract_ps(...); float f; std::memcpy(&f, &i, sizeof(f));. As for me, float f = static_cast<float const&>(_mm_extract_ps(...)); is more simple short and clear. Ref because function returns value, not pointer, const because you can't change it. It looks like intuitive solution. Const is only compiler issue, there is no any const instruction in final assembly.

  • 1
    `*(float*)&intval` is not portable across compilers. It's only safe with MSVC. On others, it's a strict-aliasing violation. Use memcpy (ISO C++), or a union (ISO C99, and MSVC++ and compilers that support GNU extensions, i.e. gcc/clang/icc). – Peter Cordes Jan 31 '19 at 12:39
  • 2
    Nice blog link, though. Good article. – Peter Cordes Jan 31 '19 at 12:49
  • [C99 onward allows you to use `union` for type punning](https://stackoverflow.com/q/11639947/995714) but C++ doesn't allow that. The standard way in C++ would be `memcpy` [Opinions on type-punning in C++?](https://stackoverflow.com/q/346622/995714), [What's a proper way of type-punning a float to an int and vice-versa?](https://stackoverflow.com/q/17789928/995714), [Floating point number to 32 and 64bit binary representation](https://stackoverflow.com/q/52613628/995714) – phuclv Jan 31 '19 at 16:01
  • `extractps r/m32, xmm, imm8` is the FP version of `pextrd r/m32, xmm, imm8` for XMM -> general-purpose register / memory. If you have a `__m128i`, use `_mm_extract_epi32` pextrd. But yes the extractps intrinsic is badly designed; they should have exposed the memory-operand version with something like `void _mm_extract_store_ps(float*, __m128, int)`. because that's hard to express with the existing intrinsic. The register-destination version is maybe useful for a case where you want to do something to the float bit-pattern with integer stuff, like for a scalar `exp()` on the selected element. – Peter Cordes Feb 01 '19 at 03:03
  • @PeterCordes thats intel issues. I don't understand why did they do extractps + pextrd. I think it just 1 intstruction. I don't think they set some different flags or smth else. I can use for example _mm_extract_ps(__m128i imm, int i) { return _mm_extract_epi32(_mm_castps_si128(imm), i); } actually with the same result. From intels manual extractps: DEST[31:0]=(SRC[127:0] >> (SRC_OFFSET*32)) AND 0FFFFFFFFh and pextrd: TEMP=(Src >> SEL*32) AND FFFF_FFFFH; DEST = TEMP; It can be called `please find the difference in practice`. – truthfinder Feb 01 '19 at 04:57
  • What do you mean " there is no way for single instruction result" with `memcpy`? Have you looked at optimized compiler output? But yes, I really wish `reinterpret_cast(my_int)` would Just Work, because memcpy is ugly unless you wrap it with an inline function; doesn't seem like it would be a hard language feature to include. Especially if it would error at compile time in cases where the types weren't the same size or the destination type was not trivially copyable. – Peter Cordes Feb 01 '19 at 04:57
  • I tried to ask Intel about bad design of _mm_extract_ps, but they just don't want to talk. I guess they just don't care. – truthfinder Feb 01 '19 at 04:58
  • @truthfinder: I think `extractps` exists for the same reason that both `movaps` and `movdqa` exist. Or `andps` vs. `pand`: vec-int vs. vec-FP domain crossing. Some CPUs have extra bypass latency when the result of a `ps` instruction is used as an input to a packed-integer instruction. Nehalem was very much like this, with 2-cycle extra latency. Nehalem was being designed while SSE4.1 was introduced (with Penryn, the generation before). Sandybridge-family removed a lot of that, especially for shuffle instructions, but that's why there are vec-int and FP versions of many operations. – Peter Cordes Feb 01 '19 at 05:00
  • Intel's asm instruction design is usually pretty decent, but their intrinsics are often dumb. e.g. there are few clean ways to express `pmovzxbd xmm0, dword [rdi]`. Many compilers won't optimize away a separate `movd` load from using narrow `_mm_set` or `_mm_load` to create a `__m128` which `_mm_cvtepi8_epu32` needs as input. And converting from scalar float to `__m128` is dumb: you often can't stop the compiler from emitting instructions to zero the upper elements, because there's no cast intrinsic that allows garbage in the upper elements. Only clang reliably optimizes it away. – Peter Cordes Feb 01 '19 at 05:06
  • @PeterCordes I found that yes, single instruction is possible. Even through memcpy. But beautifulness of memcpy is still questionable. It's a synthetic sample, but with appropriate result: https://godbolt.org/z/OLURD4 – truthfinder Feb 01 '19 at 06:14
  • 2
    Nobody is claiming that `memset` is beautiful. It's the only *fully* portable option (although union type punning is supported in C++ on most compilers people care about). Either way you can write a `static inline float pun_to_float(int)` wrapper for it, or maybe even a templated `T1 pun(T2)` and hide the mess inside a portable wrapper that you implement however you want. Still more cumbersome than if C++ would just provide type punning as a language built-in, but they don't. – Peter Cordes Feb 01 '19 at 06:20