Does this optimization even matter?

Question

This page recommends "loop unrolling" as an optimization:

Loop overhead can be reduced by reducing the number of iterations and replicating the body of the loop.

Example:

In the code fragment below, the body of the loop can be replicated once and the number of iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++)
  g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2)
{
  g ();
  g ();
}

With GCC 5.2, loop unrolling isn't enabled unless you use -funroll-loops (it's not enabled in either -O2 or -O3). I've inspected the assembly to see if there's a significant difference.

g++ -std=c++14 -O3 -funroll-loops -c -Wall -pedantic -pthread main.cpp && objdump -d main.o

Version 1:

   0:   ba 64 00 00 00          mov    $0x64,%edx
   5:   0f 1f 00                nopl   (%rax)
   8:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # e <main+0xe>
   e:   83 c0 01                add    $0x1,%eax
  # ... etc ...
  a1:   83 c1 01                add    $0x1,%ecx
  a4:   83 ea 0a                sub    $0xa,%edx
  a7:   89 0d 00 00 00 00       mov    %ecx,0x0(%rip)        # ad <main+0xad>
  ad:   0f 85 55 ff ff ff       jne    8 <main+0x8>
  b3:   31 c0                   xor    %eax,%eax
  b5:   c3                      retq

Version 2:

   0:   ba 32 00 00 00          mov    $0x32,%edx
   5:   0f 1f 00                nopl   (%rax)
   8:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # e <main+0xe>
   e:   83 c0 01                add    $0x1,%eax
  11:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 17 <main+0x17>
  17:   8b 0d 00 00 00 00       mov    0x0(%rip),%ecx        # 1d <main+0x1d>
  1d:   83 c1 01                add    $0x1,%ecx
  # ... etc ...
 143:   83 c7 01                add    $0x1,%edi
 146:   83 ea 0a                sub    $0xa,%edx
 149:   89 3d 00 00 00 00       mov    %edi,0x0(%rip)        # 14f <main+0x14f>
 14f:   0f 85 b3 fe ff ff       jne    8 <main+0x8>
 155:   31 c0                   xor    %eax,%eax
 157:   c3                      retq

Version 2 produces more iterations. What am I missing?

Care to post the source for that code in order to get a better answer? — Marco A., Sep 04 '15 at 18:29
IMO, it is best to leave it to the compiler to unroll loops. even Intel says that it is best to keep loops unrolled. https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling — Max, Sep 04 '15 at 18:32
I think the example was to demonstrate the idea. It is quite costly to spend development time optimizing code like in this example. — StahlRat, Sep 04 '15 at 18:39
How does example 2 produce more *iterations*? The loop counter is incremented by 2, thus the loop is executed half as many times. The compiler probably has *duplicated the code in the loop* but not increased the number of iterations. **Big Difference between iterations and loop code content**. — Thomas Matthews, Sep 04 '15 at 18:47
The only way to find out is to profile your code. Loop unrolling could help because it reduces the number of jumps, or it could hurt because the larger loop body is no longer all in the cache. — Davislor, Sep 04 '15 at 20:24

score 4 · Answer 1 · answered Sep 04 '15 at 18:45

4

Yes, there are cases where loop unrolling will make the code more efficient.

The theory is reduce the less overhead (branching to top of loop and incrementing loop counter).

Most processors hate branch instructions. They love data processing instructions. For every iteration, there is a minimum of one branch instruction. By "duplicating" a set of code, the number of branches is reduced and the data processing instructions is increased between branches.

Many modern compilers have optimization settings to perform loop unrolling.

answered Sep 04 '15 at 18:45

Thomas Matthews

56,849
17
98
154

This isn't a complete picture for modern CPUs with branch predictors, speculative execution and instruction caching. If anything, the *payoff* for loop unrolling is on a steady decline. That doesn't contradict anything in this answer though. – Brett Hale Sep 04 '15 at 19:10
The fundamental theory is that data processing instructions are faster to process than branch instructions. As you say, branch instructions cause the branch prediction to kick in, which usually is slower than a data processing cycle; it does speed up a branch processing cycle though. – Thomas Matthews Sep 04 '15 at 19:13

score 1 · Answer 2 · answered Sep 04 '15 at 20:35

It doesn’t produce more iterations; you’ll notice that the loop that calls g() twice runs half as many times. (What if you have to call g() an odd number of times? Look up Duff’s Device.)

In your listings, you’ll notice that the assembly-language instruction jne 8 <main+0x8> appears once in both. This tells the processor to go back to the start of the loop. In the original loop, this instruction will run 99 times. In the rolled loop, it will only run 49 times. Imagine if the body of the loop is something very short, just one or two instructions. These jumps might be a third or even half of the instructions in the most performance-critical part of your program! (And there is even a useful loop with zero instructions: BogoMIPS. But the article about optimizing that was a joke.)

So, unrolling the loop trades speed for code size, right? Not so fast. Maybe you’ve made your unrolled loop so big that the code at the top of the loop is no longer in the cache, and the CPU needs to fetch it. In the real world, the only way to know if it helps is to profile your program.

Does this optimization even matter?

2 Answers2