There's an interesting question about C++ cast semantics (which Microsoft already briefly answered for you), but it's mixed up with your misuse of _mm_extract_ps
resulting in needing a type-pun in the first place. (And only showing asm that is equivalent, omitting the int->float conversion.) If someone else wants to expand on the standard-ese in another answer, that would be great.
TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.
template <int i> float get(__m128 input) {
__m128 tmp = input;
if (i) // constexpr i means this branch is compile-time-only
tmp = _mm_shuffle_ps(tmp,tmp,i); // shuffle it to the bottom.
return _mm_cvtss_f32(tmp);
}
If you actually have a memory-destination use case, you should be looking at asm for a function that takes a float*
output arg, not a function that needs the result in xmm0
. (And yes, that is a use-case for the extractps
instruction, but arguably not the _mm_extract_ps
intrinsics. gcc and clang use extractps
when optimizing *out = get<2>(in)
, although MSVC misses that and still uses shufps + movss.)
Both blocks of asm you show are simply copying the low 32 bits of xmm0 somewhere, with no conversion to int. You left out the important different, and only showed the part that just uselessly copies the float
bit-pattern out of xmm0 and then back, in 2 different ways (to register or to memory). movd
is a pure copy of the bits unmodified, just like the movss load.
It's the compiler's choice which to use, after you force it to use extractps
at all. Going through a register and back is lower latency than store/reload, but more ALU uops.
The (float const&)
attempt to type-pun does include a conversion from FP to integer, which you didn't show. As if we needed any more reason to avoid pointer/reference casting for type-punning, this really does mean something different: (float const&)f takes the integer bit-pattern (from _mm_extract_ps
) as an int
and converts that to float
.
I put your code on the Godbolt compiler explorer to see what you left out.
float get1_with_extractps_const(__m128 fmm) {
int f = _mm_extract_ps(fmm, 1);
return (float const&)f;
}
;; from MSVC -O2 -Gv (vectorcall passes __m128 in xmm0)
float get1_with_extractps_const(__m128) PROC ; get1_with_extractps_const, COMDAT
extractps eax, xmm0, 1 ; copy the bit-pattern to eax
movd xmm0, eax ; these 2 insns are an alternative to pxor xmm0,xmm0 + cvtsi2ss xmm0,eax to avoid false deps and zero the upper elements
cvtdq2ps xmm0, xmm0 ; packed conversion is 1 uop
ret 0
GCC compiles it this way:
get1_with_extractps_const(float __vector(4)): # gcc8.2 -O3 -msse4
extractps eax, xmm0, 1
pxor xmm0, xmm0 ; cvtsi2ss has an output dependency so gcc always does this
cvtsi2ss xmm0, eax ; MSVC's way is probably better for float.
ret
Apparently MSVC does define the behaviour of pointer/reference casting for type-punning. Plain ISO C++ doesn't (strict aliasing UB), and neither do other compilers. Use memcpy
to type-pun, or a union (which GNU C and MSVC support in C++ as an extension). Of course in this case, type-punning the vector element you want to an integer and back is horrible.
Only for (float &)f
does gcc warn about the strict-aliasing violation. And GCC / clang agree with MSVC that only this version is a type-pun, not materializing a float
from an implicit conversion. C++ is weird!
float get1_with_extractps_nonconst(__m128 fmm) {
int f = _mm_extract_ps(fmm, 1);
return (float &)f;
}
<source>: In function 'float get_with_extractps_nonconst(__m128)':
<source>:21:21: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
return (float &)f;
^
gcc optimizes away the extractps
altogether.
# gcc8.2 -O3 -msse4
get1_with_extractps_nonconst(float __vector(4)):
shufps xmm0, xmm0, 85 ; 0x55 = broadcast element 1 to all elements
ret
Clang uses SSE3 movshdup
to copy element 1 to 0. (And element 3 to 2).
But MSVC doesn't, which is another reason to never use this:
float get1_with_extractps_nonconst(__m128) PROC
extractps DWORD PTR f$[rsp], xmm0, 1 ; store
movss xmm0, DWORD PTR f$[rsp] ; reload
ret 0
Don't use _mm_extract_ps
for this
Both of your versions are horrible because this is not what _mm_extract_ps
or extractps
are for. Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?
A float
in a register is the same thing as the low element of a vector. The high elements don't need to be zeroed. And if they did, you'd want to use insertps
which can do xmm,xmm and zero elements according to an immediate.
Use _mm_shuffle_ps
to bring the element you want to the low position of a register, and then it is a scalar float. (And you can tell a C++ compiler that with _mm_cvtss_f32
). This should compile to just shufps xmm0,xmm0,2
, without an extractps
or any mov
.
template <int i> float get() const {
__m128 tmp = fmm;
if (i) // i=0 means the element is already in place
tmp = _mm_shuffle_ps(tmp,tmp,i); // else shuffle it to the bottom.
return _mm_cvtss_f32(tmp);
}
(I skipped using _MM_SHUFFLE(0,0,0,i)
because that's equal to i
.)
If your fmm
was in memory, not a register, then hopefully compilers would optimize away the shuffle and just movss xmm0, [mem]
. MSVC 19.14 does manage to do that, at least for the function-arg on the stack case. I didn't test other compilers, but clang should probably manage to optimize away the _mm_shuffle_ps
; it's very good at seeing through shuffles.
Test-case proving this compiles efficiently
e.g. a test-case with a non-class-member version of your function, and a caller that inlines it for a specific i
:
#include <immintrin.h>
template <int i> float get(__m128 input) {
__m128 tmp = input;
if (i) // i=0 means the element is already in place
tmp = _mm_shuffle_ps(tmp,tmp,i); // else shuffle it to the bottom.
return _mm_cvtss_f32(tmp);
}
// MSVC -Gv (vectorcall) passes arg in xmm0
// With plain dumb x64 fastcall, arg is on the stack, and it *does* just MOVSS load without shuffling
float get2(__m128 in) {
return get<2>(in);
}
From the Godbolt compiler explorer, asm output from MSVC, clang, and gcc:
;; MSVC -O2 -Gv
float get<2>(__m128) PROC ; get<2>, COMDAT
shufps xmm0, xmm0, 2
ret 0
float get<2>(__m128) ENDP ; get<2>
;; MSVC -O2 (without Gv, so the vector comes from memory)
input$ = 8
float get<2>(__m128) PROC ; get<2>, COMDAT
movss xmm0, DWORD PTR [rcx+8]
ret 0
float get<2>(__m128) ENDP ; get<2>
# gcc8.2 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
shufps xmm0, xmm0, 2 # with -msse4, we get unpckhps
ret
# clang7.0 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
unpckhpd xmm0, xmm0 # xmm0 = xmm0[1,1]
ret
clang's shuffle optimizer simplifies to unpckhpd
, which is faster on some old CPUs. Unfortunately it didn't notice it could have used movhlps xmm0,xmm0
, which is also fast and 1 byte shorter.