does unrolling loops in x86-64 actually make code faster?

Question

I assume everyone knows what "unrolling loops means". Just in case I'll give a concrete example in a moment. The question I will ask is... does unrolling loops in x86-64 assembly language actually make code faster? I will explain why I begin to question this notion.

For those not familiar with the term "unrolling loops", here is one example of a loop from code I am writing now:

    movq   $15, %rcx                  # rcx = loop iterations
s1024_divide_compare_loop:
    movq   (%r14, %rcx, 8), %rax      # rax = numerator
    subq   (%r15, %rcx, 8), %rax      # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)
    subq   $1, %rcx                   # rcx = rcx - 1 : one less loop interation
    jns    s1024_divide_compare_loop  # check next lower 64-bit portion of n & d

And here is what that loop looks like unrolled:

    movq   120(%r14), %rax            # rax = numerator
    subq   120(%r15), %rax            # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

    movq   112(%r14), %rax            # rax = numerator
    subq   112(%r15), %rax            # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

    movq   104(%r14), %rax            # rax = numerator
    subq   104(%r15), %rax            # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

    movq   96(%r14), %rax             # rax = numerator
    subq   96(%r15), %rax             # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)
 #
 # insert 11 more copies of the above 4 lines (with different offsets) here
 #
    movq   0(%r14), %rax              # rax = numerator
    subq   0(%r15), %rax              # flags = numerator - denominator
    js     s1024_divide_done          # divide done: (numerator < denominator)
    jnz    s1024_upshift_done         # do the divide: (numerator > denominator)

I should note that the above example is representative, but may well not be optimal for this discussion. The reason is the large number of conditional branches. While I believe "branches not taken" are similar to other instructions, that assumption might not be true in some cases (which might be obscure). So if this aspect is relevant, we can assume those two conditional branches are just simple instructions like movq or addq for this discussion (though addressing both cases separately is desirable if different).

Oh, two final conditions:

#1: This question ONLY applies to a running code on single thread.

#2: This question ONLY applies to fast modern ~4GHz CPUs (FX-8350, etc).

Okay, now for thoughts that make me question whether unrolling loops is actually wise.

These new processors run at 4GHz, and can sometimes execute two or three instructions in each cycle. In my code above, the first 2 instructions cannot execute in parallel, and probably the last 3 also cannot. But a loop with instructions that can execute in parallel only makes my questions more relevant.

So, each instruction may execute in 0.25 nanoseconds (or less for instrutions that execute in parallel). This means 4 instructions takes 1 nanosecond to execute. Each set of 4 instructions roughly consumes 16-bytes, which I assume is 1/4 of a cache line. Therefore, 4 sets of 4 lines should take 4ns to execute at which point it has exited the cache line and needs another.

This is where the question becomes more complicated.

So, after 16-instructions and 1/4 of my entire unrolled loop, the CPU needs more instructions to execute. If the CPU was running the loop version of the code, it would execute the exact same instructions again, which will definitely still be in the L1 cache, and therefore continue to execute a full-bore CPU speed.

However, can we reasonably expect the CPU to have loaded the next cache line in only 4ns? Or in the case of instructions that could execute in parallel, can we reasonably expect the CPU to have loaded the next cache line in only 2ns?

From what I know about dynamic RAM, that seems... tight. I know when the CPU accesses contiguous addresses, it can leave RAS (upper address bits) latched and clock out consecutive 64-bit or 128-bit chunks of memory faster with CAS. But, can we really expect the CPU to read 64-bytes in 2ns or 4ns? That is 4 or 8 read-from-DRAM cycles, depending on whether the CPU is reading 64-bits (8-bytes) or 128-bits (16-bytes) per read operation.

My specific code may exercise this question even further. By the nature of this algorithm, my code needs to compare the most-significant portions of the numerator and denominator first, then work downward towards lower addresses each access. Does this make automatic pre-fetching less likely to work?

I have seen various people test loops versus unrolled loops. But every instance I have seen has a fatal design flaw. It keeps calling the same routines over and over and over again... usually millions of times... in order to get a large enough timer value to make sense of. But wait! Like most applications, code will likely call my 1024-bit divide functions only occasionally. That is nothing like these tests I see, which by their very nature assure that both the instructions stay in L1 instruction cache, and the accessed data stays in L1 data cache.

Of course unrolled loops are faster if you make sure code and data is already in L1 cache! Duh!

Those are not representative tests - not even close!

Those tests do measure "best case performance", but fail to represent normal program execution at all. But to decide how to best write 64-bit assembly-language code (or the code emitter portion of compilers), we need to be more realistic in our premises. Or at least I think so, which is why I ask this question.

Has anyone been through these questions in a thorough and realistic way?

It depends on the loops and how they're unrolled. I know that removing Duff's devices from the XFree86 code made Xorg faster. — Elliott Frisch, Jan 03 '14 at 04:58
Unrolling factors are generally less than 16, especially where the number of iterations can't be precomputed. — Ben Voigt, Jan 03 '14 at 06:16
Though this isn't a general case, the specific case I show will almost always branch out on the 1st or 2nd of 16 loop iterations. How does this impact the decision? By "unrolling factor", you mean no more than 16 elements should be unrolled into a sequence, and any further should be executed in the loop? — honestann, Jan 03 '14 at 06:58
Have you looked into http://www.agner.org/optimize/microarchitecture.pdf? And other books at http://www.agner.org/optimize/? — dbrank0, Jan 03 '14 at 10:25
Why do you think calling the same code many times (and therefore having it cached) is not realistic? Most real-world workloads should behave that way, at least in their performance critical phases. — Leeor, Jan 03 '14 at 11:08
This was a concern 10+ years ago. Modern x86 cores have so much L1 instruction cache that it just doesn't matter anymore. — Hans Passant, Jan 03 '14 at 11:12
@HansPassant: Do you consider 16KB data cache per CPU and and 32KB code cache per CPU to be "large"? That's what `FX-8350` CPUs have. That's enough for one typical function, but not much more. — honestann, Jan 03 '14 at 12:39
@Leeor: Well, perhaps the most central, performance critical parts of an application will be called in a tight loop, but the majority of support functions are called at random times. For example, right now I'm writing a s1024 divide function, which will be called... well... whenever someone needs to perform a 1024-bit divide. Which is unlikely to keep it in L1 cache, and quite possibly not L2 or L3 cache either. So I'd say most functions are only called now and then, but yes, the most key, core, performance critical routines likely stay in cache since they tend to be in tight loops. — honestann, Jan 03 '14 at 12:44

score 3 · Answer 1 · answered Jan 03 '14 at 06:06

3

Intel provides an optimization manual for people who are interested in tuning code for their processors which contains some treatment of loop unrolling. I wouldn't expect a nice simple answer, but it may help.

By the way, the loop instruction should be avoided. It's been slower than the equivalent instructions for many years now.

answered Jan 03 '14 at 06:06

gsg

9,167
1
21
23

Thanks for the tips. I read the intel manual you mention. I also looked in the agner fog manuals for instruction timing of the `loop` instruction and found the timing exactly the same as other conditional branch instructions. Since it also performs the subtract on the loop counter, this would seem to make `loop` faster than any alternative. Which is why I do not understand your comment, or the similar comment in the agner fog document on optimizing assembly language! But `loop` is horribly slow on every CPU short of bulldozer and piledriver. Yikes - also slow on all Intel CPUs!!! – honestann Jan 03 '14 at 06:50

score 1 · Answer 2 · answered Jan 03 '14 at 06:08

1

It's complicated.

To answer your final question, yes and no. There is a lot of research on optimizations, with thorough analysis. But, for some of the reasons you mention, it is almost impossible to show that one optimization actually has an effect on real-world performance. Every change made to one optimization affects other optimizations, and the work load can have differing effects.

But you could test it out yourself pretty easily (well... you will easily start to run into problems and conflicting results, I imagine). GCC lets you turn on and off specific optimizations, so find the source to a program you want to test and give it a go.

My guess? It is going to come down to caches. If (on average) cache performance is improved, then loop unrolling is worth it.

answered Jan 03 '14 at 06:08

superdesk

1,162
11
24

Looks like I'll be implementing three different methods to find out for myself. Boring, but appears necessary. – honestann Jan 04 '14 at 01:49
Except for super-central, super-critical loops that are inherently worth optimization to the last cycle... how a routine/function impacts the overall throughput of a software application is all that matters. And clearly people tend to ingore the fact that filling up the L1 and L2 caches with unrolled loops is ejecting other routines that might be executed very soon. For example, it might eject the section of code executed just after the routine/function returns to the caller. So the idea of testing functions in the context of "typical use" in "a variety of applications" is clearly wise. – honestann Jan 04 '14 at 23:49

score 1 · Answer 3 · answered Jan 03 '14 at 10:56

This level of optimization is highly dependent on the micro-architecture, and the subject is too broad for a comprehensive answer. The GMP library's mpn/x86_64 directory has READMEs and annotated assembly for different u-archs.

So yes - the contributors to GMP have dealt with these issues thoroughly. And unrolling does provide a speedup in some cases, though it's not as effective on modern x86-64. Loops that fit in the decoded instruction / uop cache, loop alignment, branch prediction, avoiding partial stalls, etc., are important as well. Agner Fog's optimization manuals are another excellent resource.

Finally, using bitwise shift / subtract [non]restoring division written in assembly will never be competitive with a per-word multiple precision division implementation written in C. There's a reason GMP doesn't use it. The classical 'Knuth algorithm D' takes some effort (pen and paper) to understand - particularly the quotient estimate / quotient adjust conditions. Otherwise, I fear your efforts here might be wasted.

With fixed operand sizes you can store normalized divisors and working remainders on the stack. The algorithm is actually dominated by the cost of multiply instructions, as division instructions are only used in the estimate step. The Handbook of Applied Cryptography, Chapter 14, is a good reference for implementation details.

I will check out the "Knuth algorithm D" to see what it offers. Certainly the general idea is attractive, which is why I implemented my multiply functions based upon that idea (by accumulating intermediate results of x64 `128-bit = 64-bit * 64-bit` instructions. However, I still wonder whether it can beat the newton-raphson. Probably I'll implement all three techniques and compare them. Wrote the first last night (though I still need to test for valid answers). Now I'll write the newton-raphson, and try to find a description of "Knuth algorithm D". Hopefully on the internet? Maybe? — honestann, Jan 04 '14 at 01:46
Those Agner Fog documents are fantastic - found them ~2 years ago. — honestann, Jan 04 '14 at 01:51
@honestann - I'm sure you'll find other [references](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.5815&rep=rep1&type=pdf). It really is worthwhile, as the quotient adjust 'concepts' extend to other algorithms - Barrett reduction, 2-by-1 or 3-by-2 word division cases, etc. I'm not sure about the break-even for Newton-Raphson - it might be a bit high for fixed 1024-bit lengths. — Brett Hale, Jan 04 '14 at 07:01

does unrolling loops in x86-64 actually make code faster?

3 Answers3