This is interesting.
First guess
My first guess would be that gcc's loop unroll analysis expects the addition case to benefit less from loop unrolling because s
grows more slowly.
I experiment on the following code:
#include <stdio.h>
int main(int argc, char **args) {
int s;
int w = 255;
for (s = 1; s < 32; s = s * 2)
{
w = w ^ (w >> s);
}
printf("%d", w); // To prevent everything from being optimized away
return 0;
}
And another version that is the same except the loop has s = s + s
. I find that gcc 4.9.2 unrolls the loop in the multiplicative version but not the additive one. This is compiling with
gcc -S -O3 test.c
So my first guess is that gcc assumes the additive version, if unrolled, would result in more bytes of code that fit in the icache and therefore does not optimize. However, changing the loop condition from s < 32
to s < 4
in the additive version still doesn't result in an optimization, even though it seems gcc should easily recognize that there are very few iterations of the loop.
My next attempt (going back to s < 32
as the condition) is to explicitly tell gcc to unroll loops up to 100 times:
gcc -S -O3 -fverbose-asm --param max-unroll-times=100 test.c
This still produces a loop in the assembly. Trying to allow more instructions in unrolled loops with --param max-unrolled-insns retains the loop as well. Therefore, we can pretty much eliminate the possibility that gcc thinks it's inefficient to unroll.
Interestingly, trying to compile with clang at -O3
immediately unrolls the loop. clang is known to unroll more aggressively, but this doesn't seem like a satisfying answer.
I can get gcc to unroll the additive loop by making it add a constant and not s
itself, that is, I do s = s + 2
. Then the loop unrolls.
Second guess
That leads to me theorize that gcc is unable to understand how many iterations the loop will run for (necessary for unrolling) if the loop's increase value depends on the counter's value more than once. I change the loop as follows:
for (s = 2; s < 32; s = s*s)
And it does not unroll with gcc, while clang unrolls it. So my best guess, in the end, is that gcc fails to calculate the number of iterations when the loop's increment statement is of the form s = s (op) s
.