Loading non contiguous values with Intel SIMD SSE

Question

I'd like to load a 128 bit register with 32-bit non-contiguous floats. Actually, those floats are spaced by 128 bits in memory.

So if memory looks like that :

| Float 0  | Float X | Float X | Float X |
| Float 4  | Float X | Float X | Float X |
| Float 8  | Float X | Float X | Float X |
| Float 12 | Float X | Float X | Float X |

I'd like to load a vector like this :

| Float 0  | Float 4 | Float 8 | Float 12 |

Hopefully you can vectorize your code such that you also use the other floats. Then just load all of them and do some shuffling around. — Jester, Mar 17 '16 at 16:55
On AVX2-capable machines you can `VGATHER[D/Q]PS`, but that may be slower than loading the values one by one. — EOF, Mar 17 '16 at 17:02
Do you need the other columns too? Consider transposing if you do — harold, Mar 17 '16 at 17:34

Peter Cordes · Answer 1 · 2017-09-15T05:42:28.293

Hopefully you're going to use the other data for something, in which case loading everything and doing a transpose is more likely to be useful.

If not, then SIMD at all is only viable if there's quite a bit of work to do once the data is in vectors, because packing it into vectors is expensive.

movss / insertps as shown in @zx485's answer is the "normal" way, like you'd probably get from a compiler if you used _mm_set_ps(f[12], f[8], f[4], f[0]);

When your stride is exactly 4, with AVX you can span all four floats with two loads, and blend.

(related: What's the fastest stride-3 gather instruction sequence? Or for stride 2, it's more obviously worth doing vector loads and shuffling.)

vmovups   ymm1, [float0]                  ; float0 and float4 in the low element of low/high lanes
vblendps  ymm1, [float8 - 4], 0b00100010  ;  { x x f12 f4 | x x f8 f0 }

This isn't great because you're likely to cross a cache-line boundary with one of the loads. You could achieve something similar with a vshufps ymm0, ymm1, [float8], 0b??????? for the 2nd load.

This might be good depending on surrounding code, especially if you have AVX2 for vpermps (with a shuffle-control vector constant) or vpermpd (with an immediate) for a lane-crossing shuffle to put the elements you want into the low 128b lane.

Without AVX2 for a cross-lane shuffle, you'd need to vextractf128 and then shufps. This might require some planning ahead to have elements in places where this shufps can put them in the right place.

This all works with intrinsics, of course, but they take a lot more typing.

score 4 · Answer 2 · edited Sep 14 '17 at 21:27

If you have AVX2 available, you could use the VGATHERDPS instruction to achieve your goal, which was explained here in this SO answer. In your case you would just have to initialize the index-vector to 0,1,2,3,... (and scale that up to 0,4,8,12 with the gather addressing mode).

.data
  .align 16
  ddIndices dd 0,1,2,3
  dpValues  REAL4 ...   ; replace 'dpValues' with your value array
.code
  lea        rsi, dpValues
  vmovdqa    xmm7, ddIndices

.loop:
  vpcmpeqw   xmm1, xmm1                 ; set to all ones
  vpxor      xmm0, xmm0                 ; break dependency on previous gather
  vgatherdps xmm0, [rsi+xmm7*4], xmm1
  ; do something with gather result in xmm0

  add        rsi, 16
  cmp        rsi, end_pointer
  jb      .loop                    ; do another gather with same indices, base+=16

XMM1 is the condition mask which selects what elements are loaded.

Be aware, that this instruction is not that fast on Haswell, but the implementation is faster on Broadwell and faster again on Skylake.

Even so, using a gather instruction for small-stride loads is probably only a win with 8-element ymm vectors on Skylake. According to Intel's optimization manual (11.16.4 Considerations for Gather Instructions), Broadwell hardware-gather with 4-element vectors has a best-case throughput of 1.56 cycles per element when the data is hot in L1D cache.

On pre-AVX2 architectures there is no way (known to me) to do this without loading all values separately like this (using SSE4.1 insertps or pinsrd).

lea      esi, dpValues
movss    xmm0, [esi]          ; breaks dependency on old value of xmm0
insertps xmm0, [esi+4], 1<<4  ; dst element index in bits 5:4 of the imm8
insertps xmm0, [esi+8], 2<<4
insertps xmm0, [esi+12], 3<<4

For integer data, the last instruction would be pinsrd xmm0, [esi+12], 3.

Without SSE4.1, shuffle movss results together with unpcklps / unpcklpd

Start with `movss` instead of `pinsrd`. It breaks the false dependency on the previous value of `xmm0` by zeroing the upper part, and is only a single uop. Never use insert/extract instructions with an index of `0` unless it's an insert where you *want* to preserve the upper part. (For integer, use `movd` / `movq`). Also you can use `insertps` (also SSE4.1, and with a more complicated imm8 where the dest position isn't the low 2 bits.) This might shorten the dep chain by a bypass delay between the last `pinsrd` and a `mulps` or something. — Peter Cordes, Mar 18 '16 at 06:37
Also, if your code doesn't have to be PIC, and you're not using multiple instructions with effective addresses (where `[reg+disp8]` saves code size), don't bother with the `lea`: just use `[dpValues + xmm7*4]`. (RIP-relative doesn't work in gathers.) — Peter Cordes, Mar 18 '16 at 06:48

Loading non contiguous values with Intel SIMD SSE

2 Answers2