0

I'm trying to use loop unrolling to optimize my code.

This was the original code

int a[N]; //arbitrary array
int vara; //arbitrary variable
int varb; //arbitrary variable
for (int i=0;i<N;i++)
     a[i]=(a[i+1]* vara) + varb;

so I tried doing this

for (int i=0;i<N-1;i+=2)
{
    int a=a[i+1]*vara;
    int b=a[i+2]*vara;
    int c=a+varb;
    int d=b+varb;
    a[i]=c;
    a[i+1]=d;
}

I thought this would work because I'm enabling the compiler to do addition and multiplication for multiple iterations at a time, which I thought would increase instruction level parallelism. Yet doing this does not speed up my code at all, what am I doing wrong?

Any other suggestions to optimize this code would also be much appreciated.

  • Which architecture are you compiling for? – Govind Parmar Jun 08 '18 at 04:11
  • 1
    Your compiler might spot that you've got undefined behaviour and do whatever it likes. When `i == N - 1`, your code accesses `a[N]` which is out of bounds of your array — and hence undefined behaviour. Don't try optimizing buggy code; make sure it is bug-free first. – Jonathan Leffler Jun 08 '18 at 04:41

1 Answers1

1

Your compiler very likely does unrolling already at high optimization levels, maybe you need -funroll-loops or something like it. But even the docs warn that this isn't a magic option to gain speed, as it costs instruction cache and program space.

Loop unrolling is basically what you've done:just have fewer loop iterations and do the work of multiple smaller iterations. Whether or not it's faster is highly dependent on the loop body and the actual machine the code is run on.

Unrolling also really only makes sense if jumps are expensive and there's an instruction level paralleism gain, which given the anti-dependency and the tuned branch predictors in modern processors is unlikely.

That said, you need to at the very least run some microbenchmarking with statistical analysis.

If I had to hazard a way for you to improve the speed on this: remove the dependency on the next element in the array. This then turns into a basic vector multiply-accumulate, which is trivial to vectorize.

nimish
  • 4,755
  • 3
  • 24
  • 34
  • 1
    I was about to suggest writing the output to a different array. That's essentially what you are saying in your last point. Also worth noting that the original code accesses the array out of bounds on the last iteration. – paddy Jun 08 '18 at 03:46
  • Yes, there's potentially undefined behavior going on here too! All the more reason to inspect the generated code... – nimish Jun 08 '18 at 03:48
  • This is not my exact code (I'm just trying to learn the technique rather than just get an answer), in my real code there is no worry of the array going out of bounds – Nikhil Srikumar Jun 08 '18 at 03:52
  • Whether or not loop unrolling actually gives a speedup depends entirely on the specifics of the code. The general technique is essentially as you have already done. If your question doesn't really have an answer I'm not sure it's a great fit for a Q&A site! – nimish Jun 08 '18 at 03:56
  • I was wondering if I was doing something wrong with the general technique, if not, I will look at some other ways to optimize my code – Nikhil Srikumar Jun 08 '18 at 04:04