2

I have four uint64_t numbers and I wish to combine them as parts of a __m256i, however, I'm lost as to how to go about this.

Here's one attempt (where rax, rbx, rcx, and rdx are uint64_t):

uint64_t a [4] = {rax,rbx,rcx,rcx};

__m256i t = _mm256_load_si256((__m256i *) &a);
cat
  • 3,888
  • 5
  • 32
  • 61
NationWidePants
  • 447
  • 8
  • 33

2 Answers2

3

If you already have an array, then yes absolutely use _mm256_loadu_si256 (or even the aligned version, _mm256_load_si256 if your array is alignas(32).) But generally don't create an array just to store into / reload from.


Use the _mm_set intrinsics and let the compiler decide how to do it. Note that they take their args with the highest-numbered element first: e.g.

__m256i vt = _mm256_set_epi64x(rdx, rcx, rbx, rax);

You typically don't want the asm to look anything like your scalar store -> vector load C source, because that would produce a store-forwarding stall.

gcc 6.1 "sees through" the local array in this case (and uses 2x vmovq / 2x vpinsrq / 1x vinserti128), but it still generates code to align the stack to 32B. (Even though it's not needed because it didn't end up needing any 32B-aligned locals).

As you can see on the Godbolt Compiler Explorer, the actual data-movement part of both ways is the same, but the array way has a bunch of wasted instructions that gcc failed to optimize away after deciding to avoid the bad way that the source was implying.

_mm256_set_epi64x works in 32bit code (with gcc at least). You get 2x vmovq and 2x vmovhps to do 64bit loads to the upper half of an xmm register. (Add -m32 to the compile options in the godbolt link).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Does anyone even use `_set_`? It seems that `_setr_` is more intuitive. This whole illusion of big-endian at the intrinsic level does nothing but confuse the crap out of all the low-level x86 programmers who eat little-endian for breakfast. – Mysticial Jul 07 '16 at 20:25
  • @Mysticial: Highest element first makes vector left-shifts actually go left, and makes your vectors match the way the insn ref manual describes them and shows them in diagrams. I find this useful when figuring out whether I can replace a shuffle with an `_mm_srli_epi64` or something. Yes it's confusing, but `_setr_` isn't the only place it occurs. I'm glad they provided both intrinsics, so you can choose the one that matches the ordering you use in comments / variable names. – Peter Cordes Jul 07 '16 at 20:47
1

Firstly, make sure your CPU even supports these AVX instructions: Performing AVX integer operation.

Secondly, from https://software.intel.com/en-us/node/514151, the pointer argument must be an aligned location. Conventionally allocated memory addresses on the stack are random and depend on the sizes of stack frames from previous calls, so may not be aligned.

Instead, just use the intrinsic type __m256i to force the compiler to align it; OR, according to https://software.intel.com/en-us/node/582952, use __declspec(align) on your a array.

Community
  • 1
  • 1
  • I wrote a cpuid check before it, so I know what's supported, just data type issues. I'll try your suggestion. – NationWidePants Jul 03 '16 at 00:54
  • You shouldn't use a scratch array at all; `_mm256_set_epi64x` produces better code. – Peter Cordes Jul 06 '16 at 15:12
  • @PeterCordes to clarify why I used a "scratch array": this was written for cython. I didn't have much of a choice if I wanted a good means by which to interface with the python side. – NationWidePants Apr 09 '21 at 14:20