This page recommends "loop unrolling" as an optimization:
Loop overhead can be reduced by reducing the number of iterations and replicating the body of the loop.
Example:
In the code fragment below, the body of the loop can be replicated once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++) g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2) { g (); g (); }
With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops
(it's not enabled in either -O2
or -O3
). I've inspected the assembly to see if there's a significant difference.
g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o
Version 1:
0: ba 64 00 00 00 mov $0x64,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
# ... etc ...
a1: 83 c1 01 add $0x1,%ecx
a4: 83 ea 0a sub $0xa,%edx
a7: 89 0d 00 00 00 00 mov %ecx,0x0(%rip) # ad <main+0xad>
ad: 0f 85 55 ff ff ff jne 8 <main+0x8>
b3: 31 c0 xor %eax,%eax
b5: c3 retq
Version 2:
0: ba 32 00 00 00 mov $0x32,%edx
5: 0f 1f 00 nopl (%rax)
8: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # e <main+0xe>
e: 83 c0 01 add $0x1,%eax
11: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 17 <main+0x17>
17: 8b 0d 00 00 00 00 mov 0x0(%rip),%ecx # 1d <main+0x1d>
1d: 83 c1 01 add $0x1,%ecx
# ... etc ...
143: 83 c7 01 add $0x1,%edi
146: 83 ea 0a sub $0xa,%edx
149: 89 3d 00 00 00 00 mov %edi,0x0(%rip) # 14f <main+0x14f>
14f: 0f 85 b3 fe ff ff jne 8 <main+0x8>
155: 31 c0 xor %eax,%eax
157: c3 retq
Version 2 produces more iterations. What am I missing?