If I write this code:
void loop1(int N, double* R, double* A, double* B) {
for (int i = 0; i < N; i += 1) {
R[i] = A[i] + B[i];
}
}
Clang (-O3
) generates the following x64 ASM as part of an unrolled version of the loop (Compiler Explorer):
.LBB0_14:
movupd xmm0, xmmword ptr [rdx + 8*rax]
movupd xmm1, xmmword ptr [rdx + 8*rax + 16]
movupd xmm2, xmmword ptr [rcx + 8*rax]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rcx + 8*rax + 16]
addpd xmm0, xmm1
movupd xmmword ptr [rsi + 8*rax], xmm2
movupd xmmword ptr [rsi + 8*rax + 16], xmm0
rdx
and rcx
are holding my input pointers (A
/B
), rsi
is the output (R
), and rax
is an offset counter. So it's loading two pairs of inputs/outputs at a time, adding them using SIMD instructions, and writing them to the output - so far so good.
If instead I write the following:
void loop2(int N, double* R, double* A, double* B) {
for (int i = 0; i < N; i += 2) {
R[i] = A[i] + B[i];
R[i + 1] = A[i + 1] + B[i + 1];
}
}
LLVM generates the following (Compiler Explorer):
.LBB0_13:
movupd xmm0, xmmword ptr [rdx + 8*rdi]
movupd xmm1, xmmword ptr [rdx + 8*rdi + 16]
movupd xmm2, xmmword ptr [rcx + 8*rdi]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rcx + 8*rdi + 16]
addpd xmm0, xmm1
movapd xmm1, xmm2
unpckhpd xmm1, xmm0 # xmm1 = xmm1[1],xmm0[1]
unpcklpd xmm2, xmm0 # xmm2 = xmm2[0],xmm0[0]
movapd xmm0, xmm2
unpcklpd xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
unpckhpd xmm2, xmm1 # xmm2 = xmm2[1],xmm1[1]
movupd xmmword ptr [rsi + 8*rdi + 16], xmm2
movupd xmmword ptr [rsi + 8*rdi], xmm0
Spacing added for clarity, because it's that middle section with the unpckhpd
etc. that's confusing me. As far as I can see, the overall effect of those 6 instructions is just to swap xmm0
and xmm2
, which seems like a waste of time.
Any idea why it's doing this? And is there a way to stop it? :p
EDIT: I edited the ASM for loop2()
to remove all similar blocks (and swap around the registers in the subsequent writes), and it appeared to run correctly and the same speed as loop1()
(~40% faster)