I'm fighting with optimizing this loop using AVX (excerpt only, NASM syntax):
.repete:
vmulpd ymm4, ymm1, ymm2
vhaddpd ymm5, ymm4, ymm4
vextractf128 xmm6, ymm5, 1
vaddsd xmm5, xmm5, xmm6
vcvtss2sd xmm7, [MSI + MCX * 4]
vmulsd xmm3, xmm7, xmm0
vaddsd xmm5, xmm5, xmm3
; Store result.
vcvtsd2ss xmm6, xmm5, xmm5
vmovss [MDI + MCX * 4], xmm6
vunpcklpd xmm7, xmm5, xmm7 (!!!!!!!!!!!!!!!!!!!!)
MSSE_SHUFFLEAVX(ymm2, ymm2, ymm7, 2, 0)
inc ecx
cmp ecx, edx
jl .repete
When the (!!!!) marked instruction is present, it is about 3x slower. If I change it to "vmovapd ymm7, ymm5" (just for a test), same thing. So apparently the dependency on xmm5 is the problem. I tried to get around it by moving the xmm5 calculations to the beginning, but no luck.
Any ideas how get around it? Or could it be something else? Is there some guidebook about these things? At the end AVX is really taken advantage by just vmulpd and vhaddpd, so maybe it is just not worth it?