0

I am trying to see how unrolling is done in GCC. I have written a C code to add elements of an array to do this.

for (i=0;i<16384;i++)
  c[i] = a[i]+b[i];

I have compiled it with -o2 flag and -funroll-all-loops.

gcc -o2 -funroll-all-loops --save-temps pleaseUnrollTheLoops.c

The object file for the above program has the following assembly code.

main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $196504, %rsp
    movl    $0, -196612(%rbp)
    jmp .L2
.L3:
    movl    -196612(%rbp), %eax
    cltq
    movl    -196608(%rbp,%rax,4), %edx
    movl    -196612(%rbp), %eax
    cltq
    movl    -131072(%rbp,%rax,4), %eax
    addl    %eax, %edx
    movl    -196612(%rbp), %eax
    cltq
    movl    %edx, -65536(%rbp,%rax,4)
    addl    $1, -196612(%rbp)
.L2:
    cmpl    $16383, -196612(%rbp)
    jle .L3
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

In each iteration it is a doing only one addition (7th line in the L3 section) and incrementing the content of rbp register by 1 (as in the last line of L3 section). This indicates that compiler is not unrolling the loop. I was expecting more additions to happen in one loop. My question is, why it is not unrolling the loop even after using the funroll flag?. Is there a possibility that compiler is not optimizing because it thinks that unrolling is not useful in this case ?. If that is true, then what should I do in order to make the compiler unroll the loops ?.

5gon12eder
  • 24,280
  • 5
  • 45
  • 92
Hegde
  • 481
  • 1
  • 8
  • 17
  • 5
    We need at least the whole block where `a`, `b`, `c` and `i` are declared. Maybe the compiler cannot see, that they may not alias (cf. `restrict`), what may prevent optimizations. – mafso Sep 20 '14 at 13:28
  • 2
    BTW, using `gcc -S -fverbose-asm -O2` is helpful when you look inside the generated assember code. – Basile Starynkevitch Sep 20 '14 at 13:35
  • 1
    For extra points, output the asm in Intel syntax. – harold Sep 20 '14 at 13:37
  • If you reduce the count to something smaller, say, 16, would the compiler unroll the loop? What about 32? 64?.. I would suspect that there's a limit after which there's a cutoff, but I do not have access to gcc to check this suspicion. – Sergey Kalinichenko Sep 20 '14 at 13:38
  • @dasblinkenlight It need not fully unroll the loop. (Which, for a loop of this size, would indeed be truly idiotic.) Grouping 8 (or so) repeated additions in each iteration would help to greatly reduce the bookkeeping overhead and makes sense for any size of the loop. However, I found that GCC is not very eager at doing this and using Duff's Device for manual unrolling can sometimes still improve performance even with all optimizations enabled. – 5gon12eder Sep 20 '14 at 13:44
  • If you are using a CPU with AVX or SSE2 (which is rather likely) you can generate significantly faster code if you use the correct options, and also indicate that the variables will not overlap using restrict. The latter is really most important. http://goo.gl/KkQm6H shows the loop with avx2 and restrict. – perh Sep 20 '14 at 14:32
  • @mafso Block of code that I am using is, int a[16384],b[16384],c[16384]; int i; for(i=0;i<16384;i++){ c[i] = a[i]+b[i]; } – Hegde Sep 21 '14 at 04:33
  • You, um, seem to have misspelled `-O2` above. (Is that the only thing you've misspelled?) – SamB Nov 21 '15 at 20:04

0 Answers0