Loop Unrolling, Performance Lab

Question

I'm trying to optimize this code by loop unrolling,

void naive_flip(int dim, pixel *src, pixel *dst) 
{
    int i, j;
    for (i = 0; i < dim; i++){
        for (j = 0; j < dim; j++){
            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j, dim)].blue  = src[RIDX(i, j, dim)].blue;
        }
    }
}

However, I haven't really done it before, so when I tried it, I got this

void flip_one(int dim, pixel *src, pixel *dst)
{
    //i will be attempting loop unrolling to optimize code
    int i, j;
    for (i=0; i<dim; i+=32)
    {
        for (int j=0; j<dim; j+=32)
        {
            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j+1, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j+2, dim)].blue  = src[RIDX(i, j, dim)].blue;   
        }
        for (int j=0; j<dim; j+=32)
        {
            dst[RIDX_F(i+1, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i+1, j+1, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i+1, j+2, dim)].blue  = src[RIDX(i, j, dim)].blue;   
        }
        for (int j=0; j<dim; j+=32)
        {
            dst[RIDX_F(i+2, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i+2, j+1, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i+2, j+2, dim)].blue  = src[RIDX(i, j, dim)].blue;   
        }
        for (int j=0; j<dim; j+=32)
        {
            dst[RIDX_F(i+3, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i+3, j+1, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i+3, j+2, dim)].blue  = src[RIDX(i, j, dim)].blue;   
        }
    }
}

When running the code, it doesn't work, and it give me this error:

"ERROR: Dimension=96, 9216 errors

E.g., The following two pixels should have equal value:

src[9215].{red,green,blue} = {22543,1426,53562}

dst[9120].{red,green,blue} = {0,0,0}"

Any help on what I'm doing wrong or what I should be doing is appreciated

EDIT I updated my code with this

void flip_one(int dim, pixel *src, pixel *dst)
{
    //i will be attempting loop unrolling to optimize code
    int i, j;
    for (i=0; i<dim; i++)
    {
        for (int j=0; j<dim; j++)
        {
            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j, dim)].blue  = src[RIDX(i, j, dim)].blue;

            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j, dim)].blue  = src[RIDX(i, j, dim)].blue;

            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j, dim)].blue  = src[RIDX(i, j, dim)].blue;

            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j, dim)].blue  = src[RIDX(i, j, dim)].blue;
        }
    }
}

I am no longer getting the error (yay!) but this doesn't actually speed it up, in fact it slows it down. Maybe I did something else wrong, but, I don't know what.

EDIT I updated the code to look like

void flip_one(int dim, pixel *src, pixel *dst)
{
    //i will be attempting loop unrolling to optimize code
    int i, j;
    for (i=0; i<dim; i++)
    {
        for (int j=0; j<dim; j+=4)
        {
            dst[RIDX_F(i, j, dim)].red   = src[RIDX(i, j, dim)].red;
            dst[RIDX_F(i, j, dim)].green = src[RIDX(i, j, dim)].green;
            dst[RIDX_F(i, j, dim)].blue  = src[RIDX(i, j, dim)].blue;

            dst[RIDX_F(i, j+1, dim)].red   = src[RIDX(i, j+1, dim)].red;
            dst[RIDX_F(i, j+1, dim)].green = src[RIDX(i, j+1, dim)].green;
            dst[RIDX_F(i, j+1, dim)].blue  = src[RIDX(i, j+1, dim)].blue;

            dst[RIDX_F(i, j+2, dim)].red   = src[RIDX(i, j+2, dim)].red;
            dst[RIDX_F(i, j+2, dim)].green = src[RIDX(i, j+2, dim)].green;
            dst[RIDX_F(i, j+2, dim)].blue  = src[RIDX(i, j+2, dim)].blue;

            dst[RIDX_F(i, j+3, dim)].red   = src[RIDX(i, j+3, dim)].red;
            dst[RIDX_F(i, j+3, dim)].green = src[RIDX(i, j+3, dim)].green;
            dst[RIDX_F(i, j+3, dim)].blue  = src[RIDX(i, j+3, dim)].blue;
        }
    }
}

That's not even proper unrolling - you're supposed to repeat the statements in the inner loops... Also it is *compiler* that is supposed to do it for you — Antti Haapala -- Слава Україні, Nov 10 '17 at 03:50
@Antii can you explain it to me more. Like I said, I've never really done it before. I though I was repeating the inner loop (while making a small change, via another post I found). Is it suppose to be *exactly* the same. I don't understand what I am doing wrong, thus me asking for help — MJS, Nov 10 '17 at 03:54
No, the *inner loop only*, i.e. you're supposed to have 3 * n consecutive `dst[something] = src[something]` without intervening `for` statements — Antti Haapala -- Слава Україні, Nov 10 '17 at 03:55
Take a look at [Loop unrolling optimization, how does this work](https://stackoverflow.com/questions/10301372/loop-unrolling-optimization-how-does-this-work) and also the wikipedia article. — David C. Rankin, Nov 10 '17 at 04:00
I updated the code, I think I get what you were saying @Antii. The code runs now, however it didn't actually optimize anything, I'm sure I'm doing something wrong here — MJS, Nov 10 '17 at 04:17
A thing about to remember about this lab is cache locality i.e. should loops in your case first loop on `i` then `j` or `j` then `i`? Also, find any invariant in your loops. It looks to be `RIDX_F(i, j, dim)` is evaluated multiple times when really it can be calculated once and saved to a local variable for multiple usages. — Miket25, Nov 10 '17 at 04:23
Have a look at the [example from Wikipedia](https://en.wikipedia.org/wiki/Loop_unrolling#Simple_manual_example_in_C). You have four copies of the body, so the loop should go up in steps of 4 (i.e. `j++` should be `j += 4`). The first copy of the body stays as is. In the second copy, replace `j` with `j+1` (hint: there are 6 places). In the next copy, replace `j` with `j+2`, and so on. — tom, Nov 10 '17 at 04:25
thanks @tom. I got the code to work, it runs better better now. However it is still not "faster" than the original code. any tips how to make it better? — MJS, Nov 10 '17 at 04:36
One simple thing you can try is to increase the step size (which will require adding more copies of the body). Make sure `dim` is a multiple of the step size, otherwise the last iteration will exceed the bounds of the array. Miket25 gave you some more advanced tips. — tom, Nov 10 '17 at 04:57
What is `RIDX_F`? Is it a macro? Can you cache the calls to `RIDX_F(i, j, dim)]` and `RIDX(i, j, dim)`? If this is something that you perform frequently, a slightly different optimization would be to reorder the array of pixels so that the pixels of each dimension are continuous, allowing you to use `memcpy`. — Blender, Nov 10 '17 at 05:47
As for your edit, wouldn't it slow your code down by a factor of 4? You didn't unroll the inner loop, you in fact just do the same copying operation 4 times instead just once. It has the same end result, but you just slowed it down. — Blender, Nov 10 '17 at 06:13
@Miket25 how would I go about taking RIDX_F(i, j, dim) out of the loop and make it one valuable instead of the red blue and green that it currently is? I know what I want it to do, but i dont know how I would do that — MJS, Nov 10 '17 at 06:23
@MJS `RIDX_F(i, j, dim)` will be inside both loops. But instead of calculating it nine times when used as an index value, it should just be calculated once and have that value saved to a local variable so that the other eight usages refer to that variable instead of reevaluation. — Miket25, Nov 10 '17 at 15:36

score 1 · Answer 1 · edited Nov 15 '17 at 22:11

The basic idea of loop unrolling is to explicitly write computations in the loop body many times instead of letting the compiler figure it out based on the loop bounds and conditions. In this way, addresses are known at compile time rather than computing them at runtime in case of rolled loops. Also cost of branching due to check on bounds is reduced. So each loop nest will have minimum unroll-threshold which is a function of its bounds and the nature computation done in loop body, beyond which unrolling will result in a speed-up. Unrolling might not provide speedup in all cases. Compilers like LLVM allow you to specify unrolling factor by using -mllvm -unroll-count=U so that you do not have to manually unroll it. I am sure GCC has an equivalent param. You could write a script to run your loop with different unroll factors to measure speedup and arrive at best unroll count.

Rolled version:

for (x = 0; x < N; x++)
 {
     operation(x);
 }

Unroll count = 2: reduces number of iterations by half, assuming N is even

for (x = 0; x < N; x+=2)
 {
     operation(x);
     operation(x+1);
 }

Unroll count = 4: reduces number of iterations to one-fourth, assuming N is divisible by 4

for (x = 0; x < N; x+=4)
 {
     operation(x);
     operation(x+1);
     operation(x+2);
     operation(x+3);
 }

If the indices are not divisible by unroll count, a residual loop is required to complete the task, which has its own overhead.

Unroll count = 4: reduces number of iterations to one-fourth, when N is not divisible by 4

//main loop
for (x = 0; x <= N-4; x+=4)
 {
     operation(x);
     operation(x+1);
     operation(x+2);
     operation(x+3);
 }
 //residual loop
 for ( ; x < N; x++)
 {
     operation(y);
 }

Another way of dealing with residual computation is using Duff's device which is basically switch based implementation of loop body to make sure last iteration of the loop takes care of the residual computation without having to write a separate loop altogether.

I added Duff's after you suggested. @tom – shrm Nov 11 '17 at 07:21 — shrm, Nov 11 '17 at 07:21

Loop Unrolling, Performance Lab

1 Answers1