6

I write a lot of vectorized loops, so 1 common idiom is

volatile int dummy[1<<10];
for (int64_t i = 0; i + 16 <= argc; i+= 16)   // process all elements with whole vector
{
  int x = dummy[i];
}
// handle remainder (hopefully with SIMD too)

But the resulting machine code has 1 more instruction than I would like (using gcc 4.9)

.L3:
        leaq    -16(%rax), %rdx
        addq    $16, %rax
        cmpq    %rcx, %rax
        movl    -120(%rsp,%rdx,4), %edx
        jbe     .L3

If I change the code to for (int64_t i = 0; i <= argc - 16; i+= 16), then the "extra" instruction is gone:

.L2:
        movl    -120(%rsp,%rax,4), %ecx
        addq    $16, %rax
        cmpq    %rdx, %rax
        jbe     .L2

But why the difference? I was thinking maybe it was due to loop invariants, but too vaguely. Then I noticed in the 5 instruction case, the increment is done before the load, which would require an extra mov due to x86's destructive 2 operand instructions. So another explanation could be that it's trading instruction parallelism for 1 extra instruction.

Although it seems there would hardly be any performance difference, can someone explain this mystery (preferably who knows about compiler transformations)?

Ideally I would like to keep the i + 16 <= size form since that has a more intuitive meaning (the last element of the vector doesn't go out of bounds)

manlio
  • 18,345
  • 14
  • 76
  • 126
Yale Zhang
  • 1,447
  • 12
  • 30
  • 1
    Only ever ask questions about optimized code, non-optimized code is literal to the program and not indicative of speed at all. Nothing will remain of what you have now when you turn on the optimizer, clearly you cannot make that faster. – Hans Passant May 09 '14 at 22:59
  • but this is optimized code (-O3), save for the volatile load? and it was to illustrate the problem in a simple way – Yale Zhang May 10 '14 at 00:30
  • I agree supercat's answer is very precise, but it's assuming transforming i + 16 <= argc to i <= argc - 16 is necessary to achieve the minimal # instructions. I would like to think if that's necessary. – Yale Zhang May 10 '14 at 19:17

1 Answers1

8

If argc were below -2147483632, and i was below 2147483632, the expressions i+16 <= argc would be required to yield an arithmetically-correct result, while the expression and i<argc-16 would not. The need to give an arithmetically-correct result in that corner case prevents the compiler from optimizing the former expression to match the latter.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • What a terrible gotcha (who would use a negative array size)! I've tried changing both the type of argc and i to be uint32_t and the resulting instruction count is still 5 as expected since underflow can still happen. But when I change the types to int64 and uint64, the instruction count reduces to 4. That would be unexpected since overflow can still happen, assuming the compiler is already so strict? – Yale Zhang May 10 '14 at 01:07
  • But even if the compiler didn't optimize i + 16 <= argc to i <= argc - 16, it should be able to produce 4 instructions. But that will probably have to be another question since my question is currently implying why i + 16 <= argc is not the same as i <= argc - 16 – Yale Zhang May 10 '14 at 01:14
  • @YaleZhang: If the type is uint32, then i+16 <= argc will be true when expected, or when i >= 4294967280, or argc < 16. – supercat Jun 18 '14 at 01:47