If you already have an array, then yes absolutely use _mm256_loadu_si256
(or even the aligned version, _mm256_load_si256
if your array is alignas(32)
.) But generally don't create an array just to store into / reload from.
Use the _mm_set
intrinsics and let the compiler decide how to do it. Note that they take their args with the highest-numbered element first: e.g.
__m256i vt = _mm256_set_epi64x(rdx, rcx, rbx, rax);
You typically don't want the asm to look anything like your scalar store -> vector load C source, because that would produce a store-forwarding stall.
gcc 6.1 "sees through" the local array in this case (and uses 2x vmovq
/ 2x vpinsrq
/ 1x vinserti128
), but it still generates code to align the stack to 32B. (Even though it's not needed because it didn't end up needing any 32B-aligned locals).
As you can see on the Godbolt Compiler Explorer, the actual data-movement part of both ways is the same, but the array way has a bunch of wasted instructions that gcc failed to optimize away after deciding to avoid the bad way that the source was implying.
_mm256_set_epi64x
works in 32bit code (with gcc at least). You get 2x vmovq
and 2x vmovhps
to do 64bit loads to the upper half of an xmm register. (Add -m32
to the compile options in the godbolt link).