What efficient way to load x64 ymm register with 4 seperated doubles?

Question

What is the most efficient way to load a x64 ymm register with

4 doubles evenly spaced i.e. a contiguous set of doubles

0  1  2  3  4  5  6  7  8  9 10 .. 100
And i want to load for example 0, 10, 20, 30

4 doubles at any position

i.e. i want to load for example 1, 6, 22, 43

Can we assume AVX2 is available, or do you need an AVX-only solution ? — Paul R, Feb 12 '16 at 08:27
My appologies, I should have stated that it would be nice to have an AVX solution too. — David Price, Feb 12 '16 at 14:04

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.

VGATHERQPD ymm1, [rsi+xmm7*8], ymm2

Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.

which can achieve this with one instruction. Here ymm2 is the mask register with the highest bit indicating if the value should be copied to ymm1 or not(left unchanged). ymm7 contains the indices of the elements with the scale factor.

So applied to your examples, it could look like this in MASM syntax:

4 doubles evenly spaced i.e. a contiguous set of doubles

0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30

.data
  .align 16
  qqIndices dq 0,10,20,30
  dpValues  REAL8 0,1,2,3, ... 100
.code
  lea rsi, dpValues
  movapd ymm7, qqIndices
  vpcmpeqw ymm1, ymm1                     ; set to all ones
  vgatherqpd ymm0, [rsi+xmm7*8], ymm1

Now ymm0 contains four doubles 0, 10, 20, 30. Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented

So according to Mysticial's comment

I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell, 4 index loads + 2x movsd + 2x movhpd + vinsertf128 is still significantly faster than a ymm load + vgatherqpd. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.

the fastest way would be using that approach.

So "efficient" in an OpCode way would be using VGATHER and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).

EDIT: according to comments the VGATHER instructions get faster on Broadwell and Skylake.

`VPGATHERDD` (8-way gather) is slower than a sequence of `movd` / `pinsrd` on Haswell. Broadwell has faster gathers, and Skylake even faster. I'm not sure where the tipping point is. Also, don't load a vector of all-ones. Use `vpcmpeqw ymm1, ymm1` to generate the constant. — Peter Cordes, Feb 13 '16 at 20:55

ErmIg · Answer 2 · 2016-02-12T08:38:49.787

I think that you have to look for GATHER operation like VGATHERQPD.

The instruction conditionally loads up to 2 or 4 double-precision floating-point values from memory addresses specified by the memory operand (the second operand) and using qword indices. The memory operand uses the VSIB form of the SIB byte to specify a general purpose register operand as the common base, a vector register for an array of indices relative to the base and a constant scale factor.

Note that this requires AVX2, so is not applicable to Sandy Bridge/Ivy Bridge which have AVX but not AVX2.

What efficient way to load x64 ymm register with 4 seperated doubles?

2 Answers2

Linked