If you have AVX2 available, you could use the VGATHERDPS instruction to achieve your goal, which was explained here in this SO answer. In your case you would just have to initialize the index-vector to 0,1,2,3,... (and scale that up to 0,4,8,12 with the gather addressing mode).
.data
.align 16
ddIndices dd 0,1,2,3
dpValues REAL4 ... ; replace 'dpValues' with your value array
.code
lea rsi, dpValues
vmovdqa xmm7, ddIndices
.loop:
vpcmpeqw xmm1, xmm1 ; set to all ones
vpxor xmm0, xmm0 ; break dependency on previous gather
vgatherdps xmm0, [rsi+xmm7*4], xmm1
; do something with gather result in xmm0
add rsi, 16
cmp rsi, end_pointer
jb .loop ; do another gather with same indices, base+=16
XMM1 is the condition mask
which selects what elements are loaded.
Be aware, that this instruction is not that fast on Haswell, but the implementation is faster on Broadwell and faster again on Skylake.
Even so, using a gather instruction for small-stride loads is probably only a win with 8-element ymm vectors on Skylake. According to Intel's optimization manual (11.16.4 Considerations for Gather Instructions), Broadwell hardware-gather with 4-element vectors has a best-case throughput of 1.56 cycles per element when the data is hot in L1D cache.
On pre-AVX2 architectures there is no way (known to me) to do this without loading all values separately like this (using SSE4.1 insertps
or pinsrd
).
lea esi, dpValues
movss xmm0, [esi] ; breaks dependency on old value of xmm0
insertps xmm0, [esi+4], 1<<4 ; dst element index in bits 5:4 of the imm8
insertps xmm0, [esi+8], 2<<4
insertps xmm0, [esi+12], 3<<4
For integer data, the last instruction would be pinsrd xmm0, [esi+12], 3
.
Without SSE4.1, shuffle movss
results together with unpcklps
/ unpcklpd