Why are local variable length for-loops faster? Doesn't branch prediction reduce the effect of lookup times?

Question

A while back, I was reading up on some Android performance tips when I came by:

Foo[] mArray = ...

public void zero() {
    int sum = 0;
    for (int i = 0; i < mArray.length; ++i) {
        sum += mArray[i].mSplat;
    }
}

public void one() {
    int sum = 0;
    Foo[] localArray = mArray;
    int len = localArray.length;

    for (int i = 0; i < len; ++i) {
        sum += localArray[i].mSplat;
    }
}

Google says:

zero() is slowest, because the JIT can't yet optimize away the cost of getting the array length once for every iteration through the loop.

one() is faster. It pulls everything out into local variables, avoiding the lookups. Only the array length offers a performance benefit.

Which made total sense. But after thinking way too much about my computer architecture exam I remembered Branch Predictors:

a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline.

Isn't the computer assuming i < mArray.length is true and thus, computing the loop condition and the body of the loop in parallel (and only predicting the wrong branch on the last loop) , effectively removing any performance loses?

I was also thinking about Speculative Execution:

Speculative execution is an optimization technique where a computer system performs some task that may not be actually needed... The objective is to provide more concurrency...

In this case, the computer would be executing the code both as if the loop had finished and as if it was still going concurrently, once again, effectively nullifying any computational costs associated with the condition (since the computer's already performing computations for the future while it computes the condition)?

Essentially what I'm trying to get at is the fact that, even if the condition in zero() takes a little longer to compute than one(), the computer is usually going to compute the correct branch of code while it's waiting to retrieve the answer to the conditional statement anyway, so the performance loss in the lookup to myAray.length shouldn't matter (that's what I thought anyway).

Is there something I'm not realizing here?

Sorry about the length of the question.

Thanks in advance.

It's not about branch prediction. It's about having to re-read the value of `mArray.length` every time the loop condition is checked. (Branch prediction would still predict it's some value that will result in "true") Also notice the "*yet*" and that this article is quite old. — zapl, Jun 08 '16 at 17:03
But if you're re-reading (which is part of the condition checking) and computing the the body of the loop (since we're probably predicting that that's what's going to be run once the condition is checked) at the same time, I would have thought that the re-read would not take part in the problem (since it's being done concurrently with the body of the loop (and since the body of the loop will take longer than the re-read + the actually check itself, it wouldn't matter whether there is a re-read or not)). — Patrick, Jun 08 '16 at 17:12
There is no concurrency involved, loop-body instructions, then loop condition-check instructions, then a conditional jump either back to the beginning of the loop body or out. Rinse and repeat. The only "problem" is an additional fetch from memory while loading the array length. — zapl, Jun 08 '16 at 17:37

score 5 · Accepted Answer · answered Jun 08 '16 at 17:05

5

The site you linked to notes:

zero() is slowest, because the JIT can't yet optimize away the cost of getting the array length once for every iteration through the loop.

I haven't tested on Android, but I'll assume that this is true for now. What this means is that for every iteration of the loop the CPU has to execute code that loads the value of mArray.length from memory. The reason is that the length of the array may change so the compiler can't treat it as a static value.

Whereas in the one() option the programmer explicitly sets the len variable based on knowledge that the array length won't change. Since this is a local variable the compiler can store it in a register rather than loading it from memory in each loop iteration. So this will reduce the number of instructions executed in the loop, and it will make the branch easier to predict.

You are right that branch prediction helps reduce the overhead associated with the loop condition check. But there is still a limit to how much speculation is possible so executing more instructions in each loop iteration can incur additional overhead. Also many mobile processors have less advanced branch predictors and don't support as much speculation.

My guess is that on a modern desktop processor using an advanced Java JIT like HotSpot that you would not see a 3X performance difference. But I don't know for certain, it could be an interesting experiment to try.

answered Jun 08 '16 at 17:05

Gabriel Southern

9,602
12
56
95

Speculative execution actually solves a completely different problem from compile-time hoisting invariants. Without it, the pipeline couldn't start executing the next iteration until the branch at the end of the loop retired. Having no speculative execution might actually slow both loops down to the same speed, because having more work inside one of the loops might matter less. (Of course, an in-order CPU wouldn't get started executing the loop branch as quickly with more work in the loop body, and out of order execution without speculation is just not plausible.) – Peter Cordes Jun 08 '16 at 20:27
Or did you mean that superscalar / out-of-order execution could allow the fetch of `mArray.length` to happen in parallel with the work inside the loop? – Peter Cordes Jun 08 '16 at 20:28
1

Yes I meant that the fetch of the mArray.length could be done in parallel with other work in the loop. For instance if there was an operation in the loop that caused a cache miss, then that would likely be the bottleneck and the fetch of `mArray.length` (which would usually hit) wouldn't matter as much relative to the time spent servicing the miss. – Gabriel Southern Jun 08 '16 at 21:25
@PeterCordes also regarding speculation, I think most mobile CPUs today support out-of-order execution, but the structures are typically smaller than for desktop CPU so less speculation is possible. I agree with your points about an in-order CPU, but that wasn't really what I was thinking about. – Gabriel Southern Jun 08 '16 at 21:28
I *think* the small cores in big.LITTLE designs are often in-order. Even some recent 64bit Aarch64 designs are in-order. e.g. [wikipedia says Cortex-A53 is dual-issue in-order](https://en.wikipedia.org/wiki/List_of_ARM_microarchitectures). However, even in-order execution designs can hide memory latency with [simpler techniques like scoreboarding](http://stackoverflow.com/questions/36989954/multiple-accesses-to-main-memory-and-out-of-order-execution#comment61543304_36990492). (I don't know exactly what that means; I haven't looked into it. :P) So same result for hiding this load. – Peter Cordes Jun 08 '16 at 23:52
1

you are right that in-order is still used in some configurations particularly for little cores. Scoreboarding lets the CPU issue to multiple functional units simultaneously as long as there aren't any hazards. So it can provide higher utilization of CPU resources, without the overhead of structures needed for speculative execution. – Gabriel Southern Jun 09 '16 at 15:25

Why are local variable length for-loops faster? Doesn't branch prediction reduce the effect of lookup times?

1 Answers1