loop unrolling for nested for loops in C

Question

I originally have this function, and I am trying to optimize it further using loop unrolling - which I am having a trouble with - flipping the for loops increase the efficiency, as well as getting the calls outside the loops. However, when it comes to applying loop unrolling as I did, it misses up what the function should be doing.

double
applyFilter(class Filter *filter, cs1300bmp *input, cs1300bmp *output)
{

  long long cycStart, cycStop;

  cycStart = rdtscll();

  output -> width = input -> width;
  output -> height = input -> height;


  for(int col = 1; col < (input -> width) - 1; col = col + 1) {
    for(int row = 1; row < (input -> height) - 1 ; row = row + 1) {
      for(int plane = 0; plane < 3; plane++) {

    output -> color[plane][row][col] = 0;

    for (int j = 0; j < filter -> getSize(); j++) {
      for (int i = 0; i < filter -> getSize(); i++) {   
        output -> color[plane][row][col]
          = output -> color[plane][row][col]
          + (input -> color[plane][row + i - 1][col + j - 1] 
         * filter -> get(i, j) );
      }
    }
    
    output -> color[plane][row][col] =  
      output -> color[plane][row][col] / filter -> getDivisor();

    if ( output -> color[plane][row][col]  < 0 ) {
      output -> color[plane][row][col] = 0;
    }

    if ( output -> color[plane][row][col]  > 255 ) { 
      output -> color[plane][row][col] = 255;
    }
      }
    }
  }

  cycStop = rdtscll();
  double diff = cycStop - cycStart;
  double diffPerPixel = diff / (output -> width * output -> height);
  fprintf(stderr, "Took %f cycles to process, or %f cycles per pixel\n",
      diff, diff / (output -> width * output -> height));
  return diffPerPixel;
}

This is where I arrive at, but it doesn't seem to be working. I would appreciate an explanation of what I am doing wrong in the loop unrolling part.

double applyFilter(class Filter *filter, cs1300bmp *input, cs1300bmp *output){

  long long cycStart, cycStop;

  cycStart = rdtscll();

//start
    
  output -> width = input -> width;
  output -> height = input -> height;
    
//function calls outside loop.
    int filterSize = filter -> getSize();
    int divisor = filter -> getDivisor();
    
//intializaions
    
    int inputHlen = input -> height - 1;
    int inputWlen = input -> width - 1;
    

// loop unrolling row + k - 1 , col + k - 1
for(int plane = 0; plane < 3; plane++){
  for(int row = 1; row + 3 < inputHlen; row += 4){
    for(int col = 1; col +3 < inputWlen; col += 4){ 
      
              
          output -> color[plane][row][col] = 0;
          output -> color[plane][row + 1][col + 1] = 0;
          output -> color[plane][row + 2][col + 2] = 0;
          output -> color[plane][row + 3][col + 3] = 0;
          
          int acc1 = output -> color[plane][row][col];
          int acc2 = output -> color[plane][row + 1][col + 1];
          int acc3 = output -> color[plane][row + 2][col + 2];
          int acc4 = output -> color[plane][row + 3][col + 3];
          
          for (int j = 0; j + 3 < filterSize; j += 4) {
              for (int i = 0; i + 3 < filterSize; i += 4){ 
                acc1 = acc1 + (input -> color[plane][row + i - 1][col + j - 1] * filter -> get(i, j) );
                acc2 = acc2 + (input -> color[plane][row + i][col + j]         * filter -> get(i + 1, j + 1) );
                acc3 = acc3 + (input -> color[plane][row + i + 1][col + j + 1] * filter -> get(i +2, j + 2) );
                acc4 = acc4 + (input -> color[plane][row + i + 2][col + j + 2] * filter -> get(i + 3, j + 3) );
              }
            }
          
          acc1 = acc1 / divisor;
          acc2 = acc2 / divisor;
          acc3 = acc3 / divisor;
          acc4 = acc4 / divisor;

          
          acc1 = (acc1 < 0) ? 0 : acc1;
          acc1 = (acc1 > 255) ? 255 : acc1;

          acc2 = (acc1 < 0) ? 0 : acc1;
          acc2 = (acc1 > 255) ? 255 : acc1;
          
          acc3 = (acc1 < 0) ? 0 : acc1;
          acc3 = (acc1 > 255) ? 255 : acc1;
          
          acc4 = (acc1 < 0) ? 0 : acc1;
          acc4 = (acc1 > 255) ? 255 : acc1;
          
          
          output -> color[plane][row][col] = acc1;
          output -> color[plane][row + 1][col + 1] = acc2;
          output -> color[plane][row + 2][col + 2] = acc3;
          output -> color[plane][row + 3][col + 3] = acc4;
          
      }
    }
  }
  
//end

  cycStop = rdtscll();
  double diff = cycStop - cycStart;
  double diffPerPixel = diff / (output -> width * output -> height);
  fprintf(stderr, "Took %f cycles to process, or %f cycles per pixel\n",
      diff, diff / (output -> width * output -> height));
  return diffPerPixel;
}

Well I see a bunch of things that look like copy/paste errors. For example `acc4 = output->color[plane][row + 2][col + 2];`... Then there are all the clamping tests which you just didn't even try to edit. Beyond this, consider moving the `plane` loop to be the outermost loop for performance reasons as it might result in fewer cache misses. If all your row data is individually allocated, it's now time to stop doing that and use a single contiguous block of memory. Then look into SIMD or check the vectorization features of your compiler. — paddy, Jul 07 '20 at 04:19
@paddy thanks for pointing out the typos. I see why making the outermost loop the plane loop, and thanks for pointing that out, it improved the CPE. However, I would appreciate more in depth clarification of implementing loop unrolling - I am a beginner in C and CS. I can not find any examples of implementing loop unrolling in nested for loops. — Kiro Reda, Jul 07 '20 at 04:36
Loop unrolling is not a particularly common or recommended software practice nowadays, except in unique situations. Compilers are plenty smart enough to automatically unroll fixed loops where appropriate, and will usually make a better-informed decision than the programmer. This gives you the freedom to express your ideas more simply with concise, maintainable code that has fewer errors in it. — paddy, Jul 07 '20 at 04:48
In your case, I would consider unrolling to be a very narrow-minded approach to optimizing this filter convolution. You may have better success analyzing the things that will cause extra computation or execution delays. This includes improving memory access patterns, reducing expensive operations (e.g. multiply and divide), reducing/eliminating branching, etc. — paddy, Jul 07 '20 at 04:52

score 0 · Answer 1 · answered Jul 07 '20 at 04:47

One problems is that you want to unroll only the innermost loop (col). By trying to unroll two loops (row and col) you're missing a lot of calculation. One way to see this is that you've cut down your loop iterations by a factor of 16 (the row and col loops both step by 4), while the loop body only does 4 times the work.

This is because you're only looking at the "diagonal" elements:

color[plane][1][1]
color[plane][2][2]
color[plane][3][3]
color[plane][4][4]

for the first loop iteration, rather than

color[plane][1][1]
color[plane][1][2]
color[plane][1][3]
color[plane][1][4]

You're never computing 12 of those values. color[plane][2][1] would be one of them. You should keep row incrementing by 1 every iteration, and should be updating color[plane][row][...] each loop.

loop unrolling for nested for loops in C

1 Answers1