5

If I write this code:

void loop1(int N, double* R, double* A, double* B) {
    for (int i = 0; i < N; i += 1) {
        R[i] = A[i] + B[i];
    }
}

Clang (-O3) generates the following x64 ASM as part of an unrolled version of the loop (Compiler Explorer):

.LBB0_14:
    movupd  xmm0, xmmword ptr [rdx + 8*rax]
    movupd  xmm1, xmmword ptr [rdx + 8*rax + 16]
    movupd  xmm2, xmmword ptr [rcx + 8*rax]
    addpd   xmm2, xmm0
    movupd  xmm0, xmmword ptr [rcx + 8*rax + 16]
    addpd   xmm0, xmm1
    movupd  xmmword ptr [rsi + 8*rax], xmm2
    movupd  xmmword ptr [rsi + 8*rax + 16], xmm0

rdx and rcx are holding my input pointers (A/B), rsi is the output (R), and rax is an offset counter. So it's loading two pairs of inputs/outputs at a time, adding them using SIMD instructions, and writing them to the output - so far so good.

If instead I write the following:

void loop2(int N, double* R, double* A, double* B) {
    for (int i = 0; i < N; i += 2) {
        R[i] = A[i] + B[i];
        R[i + 1] = A[i + 1] + B[i + 1];
    }
}

LLVM generates the following (Compiler Explorer):

.LBB0_13:
    movupd  xmm0, xmmword ptr [rdx + 8*rdi]
    movupd  xmm1, xmmword ptr [rdx + 8*rdi + 16]
    movupd  xmm2, xmmword ptr [rcx + 8*rdi]
    addpd   xmm2, xmm0
    movupd  xmm0, xmmword ptr [rcx + 8*rdi + 16]
    addpd   xmm0, xmm1

    movapd  xmm1, xmm2
    unpckhpd        xmm1, xmm0      # xmm1 = xmm1[1],xmm0[1]
    unpcklpd        xmm2, xmm0      # xmm2 = xmm2[0],xmm0[0]
    movapd  xmm0, xmm2
    unpcklpd        xmm0, xmm1      # xmm0 = xmm0[0],xmm1[0]
    unpckhpd        xmm2, xmm1      # xmm2 = xmm2[1],xmm1[1]

    movupd  xmmword ptr [rsi + 8*rdi + 16], xmm2
    movupd  xmmword ptr [rsi + 8*rdi], xmm0

Spacing added for clarity, because it's that middle section with the unpckhpd etc. that's confusing me. As far as I can see, the overall effect of those 6 instructions is just to swap xmm0 and xmm2, which seems like a waste of time.

Any idea why it's doing this? And is there a way to stop it? :p


EDIT: I edited the ASM for loop2() to remove all similar blocks (and swap around the registers in the subsequent writes), and it appeared to run correctly and the same speed as loop1() (~40% faster)

cloudfeet
  • 12,156
  • 1
  • 56
  • 57
  • You could conceivably simply generate the assembly, compile it with an assembler, and see if removing these instructions have any effect? – Micrified Feb 08 '19 at 17:16
  • 2
    Yup, looks like a missed optimization. Apparently manual unrolling confused / interfered with the automatic unrolling that happens as part of auto-vectorization and you ended up with a mess. Report it at https://bugs.llvm.org/enter_bug.cgi?product=new-bugs. It's a nice clean very minimal MCVE, and the bug happens even with `__restrict` on all 3 pointers. (But that does get LLVM not to check for overlap at runtime, or emit a scalar version of the loop for that case.) – Peter Cordes Feb 08 '19 at 17:40
  • 1
    @PeterCordes: As an aside, is LLVM/Clang assuming `R` does not overlap `A` or `B`? Why is it allowed to do that without a `restrict` qualifier? – Eric Postpischil Feb 08 '19 at 20:01
  • 3
    @EricPostpischil part of the stuff at `.LBB0_8` checks for that, those `lea` compute the end addresses and then the ranges are compared for overlap – harold Feb 09 '19 at 01:48

0 Answers0