The simplest approach is VGATHERQPD which is an AVX2 instruction available on Haswell and up.
VGATHERQPD ymm1, [rsi+xmm7*8], ymm2
Using dword indices specified in vm32x, gather double-pre-cision FP values from memory conditioned on mask specified by ymm2. Conditionally gathered elements are merged into ymm1.
which can achieve this with one instruction.
Here ymm2
is the mask register with the highest bit indicating if the value should be copied to ymm1
or not(left unchanged).
ymm7
contains the indices of the elements with the scale factor.
So applied to your examples, it could look like this in MASM syntax:
4 doubles evenly spaced i.e. a contiguous set of doubles
0 1 2 3 4 5 6 7 8 9 10 .. 100 --- And i want to load for example 0, 10, 20, 30
.data
.align 16
qqIndices dq 0,10,20,30
dpValues REAL8 0,1,2,3, ... 100
.code
lea rsi, dpValues
movapd ymm7, qqIndices
vpcmpeqw ymm1, ymm1 ; set to all ones
vgatherqpd ymm0, [rsi+xmm7*8], ymm1
Now ymm0
contains four doubles 0, 10, 20, 30.
Though, I haven't tested this yet. Another thing to mention is, that this is not necessarily the fastest choice in every scenario. The values are all gathered separately, that means, each value needs one memory access, see How are the gather instructions in AVX2 implemented
So according to Mysticial's comment
I recently had to do something that required a true gather-load. (i.e. data[index[i]]). On Haswell, 4 index loads + 2x movsd + 2x movhpd + vinsertf128
is still significantly faster than a ymm load + vgatherqpd
. So even in the best case scenario, 4-way gather still loses. I haven't tried 8-way gather though.
the fastest way would be using that approach.
So "efficient" in an OpCode way would be using VGATHER
and "efficient" relating to execution time would be the last one (so far, let's see how future architectures will perform).
EDIT: according to comments the VGATHER
instructions get faster on Broadwell and Skylake.