1

Consider below loop (https://godbolt.org/z/z4Wz1aanK) that has no loop-carried dependence. Will modern CPU speculatively execute next iteration with previous one? if true, is loop expansion still necessary here?

void bar(void)
{
    for (int i = 0; i < 1024; i++)
       out[i] = foo(src[i]);
}

The result of compilation:

bar():
        pushq   %rbx
        xorl    %ebx, %ebx
.L2:
        movl    src(%rbx), %edi
        addq    $4, %rbx
        call    foo(int)
        movl    %eax, out-4(%rbx)
        cmpq    $4096, %rbx
        jne     .L2
        popq    %rbx
        ret
src:
        .zero   400
out:
        .zero   400

Update1: Now I am sure speculative execution can cross loop iterations. The question is how far that can be, considering dependency chain introduced by loop count i?

Changbin Du
  • 501
  • 5
  • 11

1 Answers1

1

Yes, this loop will likely benefit from branch prediction / speculative execution.

Loop unrolling by hand is generally considered to be an obsolete optimization, see for example here: https://www.intel.com/content/www/us/en/developer/articles/technical/avoid-manual-loop-unrolling.html

Speculative execution does not change the observed behaviour of your program. It does not even require compiler-support since it is something the CPU itself does when it encounters conditional jumps. Whether your iterations will be correctly predicted will depend on what happens inside of foo and possibly even the data in src. If foo has too many conditionals or if the conditionals follow hard-to-predict patterns.

Other optimizations may appear in the code though if the compiler thinks they are beneficial: There might be loop unrolling, there might be SIMD-operations. To see what the compiler actually does with your code you can try https://godbolt.org/

julaine
  • 382
  • 3
  • 12
  • Thank you. Will the loop count construct a dependence chain since the inc depends on previous value? – Changbin Du Jul 24 '23 at 07:38
  • Not fully sure what you mean by "construct a dependence chain". If you are worried that your `i++`-style-loop will force the machine to execute the iterations one after another, I recommend you play around with godbolt a bit. You will see that modern compilers with optimizations turned on can do a lot of things with your loops. 'Trivial' loops like yours are abstract control structures to the optimizer. There will be no `i` variable on the stack, it will exist in a register, or maybe the compiler wants to increment the pointer value directly and `i` will not exist in any way. – julaine Jul 24 '23 at 09:14
  • yes, I mean the `i++`, `src[i]`,`out[i]`. Gcc will turn `i++` into 'add rbx, 4'. So I think the parallelism only hapends after the previous rbx is calculated even the `rbx` is overwritten, right? (I pasted the asm code in question.) – Changbin Du Jul 24 '23 at 09:52
  • @ChangbinDu You cannot see speculative in the assembly output, it's a cpu-feature, a detail in how the assembly gets executed. It is true your asm does not include parallelism. It cannot contain parallelism because the compiler cannot reason about `foo`. – julaine Jul 24 '23 at 09:58
  • Please ignore compiler level parallelism optmization here. All I tried to discuss is about hardware level. (sorry for my unclear description, just think the `foo` can be anything.) – Changbin Du Jul 24 '23 at 10:03
  • 1) instruction level parallelism (branch prediction and instruction pipelining) is simply not visible in assembly. There is no instruction to `trigger` it. The cpu just does it if possible. You might want to use valgrind to get statistics about how well branch prediction works for you (but that depends also on data, not just code) 2) The fact that `foo` can be anything is relevant to the compiler - it has to just compile the code into a `call` to that function. If `foo` is known, your loop might be compiled differently. – julaine Jul 24 '23 at 10:20