AVX assembler loop gets slowed down 3x by vunpcklpd instruction

Question

I'm fighting with optimizing this loop using AVX (excerpt only, NASM syntax):

.repete:
vmulpd ymm4, ymm1, ymm2
vhaddpd ymm5, ymm4, ymm4 
vextractf128 xmm6, ymm5, 1
vaddsd xmm5, xmm5, xmm6

vcvtss2sd xmm7, [MSI + MCX * 4]
vmulsd xmm3, xmm7, xmm0

vaddsd xmm5, xmm5, xmm3

; Store result.
vcvtsd2ss xmm6, xmm5, xmm5
vmovss [MDI + MCX * 4], xmm6

vunpcklpd xmm7, xmm5, xmm7 (!!!!!!!!!!!!!!!!!!!!)
MSSE_SHUFFLEAVX(ymm2, ymm2, ymm7, 2, 0)

inc ecx
cmp ecx, edx
jl .repete

When the (!!!!) marked instruction is present, it is about 3x slower. If I change it to "vmovapd ymm7, ymm5" (just for a test), same thing. So apparently the dependency on xmm5 is the problem. I tried to get around it by moving the xmm5 calculations to the beginning, but no luck.

Any ideas how get around it? Or could it be something else? Is there some guidebook about these things? At the end AVX is really taken advantage by just vmulpd and vhaddpd, so maybe it is just not worth it?

If you're really asking this question, I have a feeling that you do not know what out-of-order execution is. If you do a dependency analysis of the code with and without that `vunpcklpd`, you should easily see why it makes such a big difference. — Mysticial, Mar 29 '15 at 01:19
I get it, but currently it is slower than MSVC's C++ implementation, which does seem to bad. Also, without the change on ymm2 based on xmm5 there is really "no change of state", but the CPU's are really that smart to know, that nothing has changed?? — mrzacek mrzacek, Mar 29 '15 at 09:52
*Is there some guidebook*: Yes http://agner.org/optimize/ explains out-of-order execution and dependencies. Also see other links in https://stackoverflow.com/tags/x86/info. — Peter Cordes, Nov 14 '17 at 18:23

AVX assembler loop gets slowed down 3x by vunpcklpd instruction

0 Answers0