Nested Loop Unrolling in C

Question

I want to optimize my code by using unrolling loop. I tried to apply unrolling but I think I cannot do it and I cannot see my problem. I want to apply unrolling loop to outer loop.

This loops do transpose of matrix.

This is my loop to apply unrolling loop:

void transpose(int dim, int *src, int *dst) {
    for (i = 0; i < dim; i++)
        for (j = 0; j < dim; j++)
            dst[j * dim + i] = src[i * dim + j];
}

This is my unrolling loop:

void transpose(int dim, int *src, int *dst) {
    int i = 0, j = 0, dimi = 0, dimj = 0, tempi = 0;

    for (i = 0; i < dim; i += 8) {
        for (j = 0; j < dim; j++) {
            dimj = j * dim + i;
            dimi = i * dim + j;
            dst[dimj] = src[dimi];

            tempi = i + 1;
            if (tempi < dim) {
                dimj = j * dim + tempi;
                dimi = tempi * dim + j;
                dst[dimj] = src[dimi];

                tempi += 1;
                if (tempi < dim) {
                    dimj = j * dim + tempi;
                    dimi = tempi * dim + j;
                    dst[dimj] = src[dimi];

                    tempi += 1;
                    if (tempi < dim) {
                        dimj = j * dim + tempi;
                        dimi = tempi * dim + j;
                        dst[dimj] = src[dimi];

                        tempi += 1;
                        if (tempi < dim) {
                            dimj = j * dim + tempi;
                            dimi = tempi * dim + j;
                            dst[dimj] = src[dimi];

                            tempi += 1;
                            if (tempi < dim) {
                                dimj = j * dim + tempi;
                                dimi = tempi * dim + j;
                                dst[dimj] = src[dimi];

                                tempi += 1;
                                if (tempi < dim) {
                                    dimj = j * dim + tempi;
                                    dimi = tempi * dim + j;
                                    dst[dimj] = src[dimi];

                                    tempi += 1;
                                    if (tempi < dim) {
                                        dimj = j * dim + tempi;
                                        dimi = tempi * dim + j;
                                        dst[dimj] = src[dimi];
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Loop unrolling as an optimization is best left to the compilers. — Chad, Jan 23 '17 at 21:27
Loop unrolling is a job for the compiler, let it do it for you. — rom1v, Jan 23 '17 at 21:27
The compiler can see if this has other side effects, such as a worse cache hit. Are you taking that into account as well? — Jongware, Jan 23 '17 at 21:28
Yeah I know, @Chad I have to optimize myself because I have to use this function in my homework. :( Can you optimize it? — bekirsevki, Jan 23 '17 at 21:33
Yeah, I'm taking it. I have to optimize myself because I have to use this function in my homework. @RadLexus — bekirsevki, Jan 23 '17 at 21:34
Okay, fair reason. When you say "I tried to apply unrolling", what makes you think it did not work? That part is missing from your question. — Jongware, Jan 23 '17 at 21:36
For each optimization operation, our speedup point increase, but I implemented above and this point did not increase, also it decreased. I think I cannot do unrolling loop. I mean my implementation is wrong. Can you see any error in my implementation above? @RadLexus — bekirsevki, Jan 23 '17 at 21:43

score 2 · Accepted Answer · edited Jan 23 '17 at 22:47

I'm not sure what the error in your current code is but here is another approach.

void transpose(int dim, int *src, int *dst) {
    int i, j;

    for (i = 0; i <= dim-8; i += 8)
    {
        for (j = 0; j < dim; j++)
        {
                dst[j * dim + (i+0)] = src[(i+0) * dim + j];
                dst[j * dim + (i+1)] = src[(i+1) * dim + j];
                dst[j * dim + (i+2)] = src[(i+2) * dim + j];
                dst[j * dim + (i+3)] = src[(i+3) * dim + j];
                dst[j * dim + (i+4)] = src[(i+4) * dim + j];
                dst[j * dim + (i+5)] = src[(i+5) * dim + j];
                dst[j * dim + (i+6)] = src[(i+6) * dim + j];
                dst[j * dim + (i+7)] = src[(i+7) * dim + j];
        }
    }

    // Use the normal loop for any remaining elements   
    for (; i < dim; i++)
        for (j = 0; j < dim; j++)
            dst[j * dim + i] = src[i * dim + j];
}

Notice: The number of multiplication can be reduced by introducing a variable like:

int jdim = j * dim + i;
dst[jdim + 0] = ...
dst[jdim + 1] = ...
...
dst[jdim + 7] = ...

and likewise for the RHS.

@SevkiBekir: this code might be faster just because of the order of reads and writes. Try swapping the `i` and the `j` loop in the naive function and benchmark that too. — chqrlie, Jan 23 '17 at 22:52

chqrlie · Answer 2 · 2017-01-23T22:01:06.203

The whole purpose of unrolling loops is to remove tests. You make no assumptions on the value of dim, so you need to keep all the tests. I doubt you will see any improvement with the unrolled code, but only careful benchmarking can tell you for a given compiler and architecture if it makes a difference.

One thing for sure: it made the code much more difficult to read and much easier to mess up.

If you know the most common values for dim, you can try and optimize those. For example if you know the most common case is 3x3 matrices, you could write this:

void transpose(int dim, const int *src, int *dst) {
    if (dim == 3) {
        dst[0 * 3 + 0] = src[0 * 3 + 0];
        dst[0 * 3 + 1] = src[1 * 3 + 0];
        dst[0 * 3 + 2] = src[2 * 3 + 0];
        dst[1 * 3 + 0] = src[0 * 3 + 1];
        dst[1 * 3 + 1] = src[1 * 3 + 1];
        dst[1 * 3 + 2] = src[2 * 3 + 1];
        dst[2 * 3 + 0] = src[0 * 3 + 2];
        dst[2 * 3 + 1] = src[1 * 3 + 2];
        dst[2 * 3 + 2] = src[2 * 3 + 2];
    } else {
        for (int i = 0; i < dim; i++) {
            for (int j = 0; j < dim; j++) {
                dst[j * dim + i] = src[i * dim + j];
            }
        }
    }
}

Modern compilers are good at optimizing the simple original code, taking advantage of hardware specific capabilities for vectorization. Unless you know exactly what to optimize and when, they will do a much better job than you could, without risking spurious bugs.

Not just the compiler. The processors often have special instructions which make loops faster than the equivalent code written linearly. — Malcolm McLean, Jan 23 '17 at 21:58

score 0 · Answer 3 · answered Jan 23 '17 at 22:51

Here is an example of an unrolled loop. Notice that the goal is to remove conditional statements and dependencies on variables. Also, this code has not been tested.

void transpose(int dim, int *src, int *dst) {
    // represent where the data is being read and where it is going
    int dstIndex = 0;
    int srcIndex = 0;

    // precalculate constants used within the loop
    int total = dim*dim;
    int unrolled = dim / 4;

    int dimx0 = dim*0;
    int dimx1 = dim*1;
    int dimx2 = dim*2;
    int dimx3 = dim*3;
    int dimx4 = dim*4;

    int i = 0;
    int j = 0;

    // since the matrix is being transposed i,j order doesn't matter as much
    // because one of the matrices will be accessed by column and the other
    // will be accessed by row (more effecient)
    for (j = 0; j < dim; j++) {
        for (i = 0; i < unrolled; i++) {
            // here the loop is being unrolled
            // notice that each statement does not rely on previous statements
            // and there is no conditional code
            dst[dstIndex + 0] = src[srcIndex + dimx0];
            dst[dstIndex + 1] = src[srcIndex + dimx1];
            dst[dstIndex + 2] = src[srcIndex + dimx2];
            dst[dstIndex + 3] = src[srcIndex + dimx3];
            dstIndex += 4;
            srcIndex += dimx4;
        }

        // the transpose was previously completed in larger blocks of 4
        // here whtever indices that were not transposed will be taken care of
        // e.g. if the matrix was 13x13, the above loop would run 3 times per row
        // and this loop would run once per row
        for (i = unrolled; i < dim; i++) {
            dst[dstIndex] = src[srcIndex];
            dstIndex += 1;
            srcIndex += dim;
        }

        // increment the source index
        srcIndex %= total;
        srcIndex += 1;
    }
}

Nested Loop Unrolling in C

3 Answers3