Ineffective "Peel/Remainder" Loop in my code

Question

I have this function:

bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{         
   bool ret = false;
   // input size (-1 for the safe bilinear interpolation)
   const int width = im.cols-1;
   const int height = im.rows-1;
   // output size
   const int halfWidth  = res.cols >> 1;
   const int halfHeight = res.rows >> 1;
   float *out = res.ptr<float>(0);
   const float *imptr  = im.ptr<float>(0);
   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      const float rx = ofsx + j * a12;
      const float ry = ofsy + j * a22;
      #pragma omp simd
      for(int i=-halfWidth; i<=halfWidth; ++i, out++)
      {
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            int rowOffset = y*im.cols;
            int rowOffset1 = (y+1)*im.cols;
            // bilinear interpolation
            *out =
                (1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x]   + wx * imptr[rowOffset+x+1]) +
                (       wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);
         } else {
            *out = 0;
            ret =  true; // touching boundary of the input            
         }
      }
   }
   return ret;
}

halfWidth is very random: it can be 9, 84, 20, 95, 111...I'm only trying to optimize this code, I don't understand it in details.

As you can see, the inner for has been already vectorized, but Intel Advisor suggests this:

And this is the Trip Count analysis result:

To my understand this means that:

Vector length is 8, so it means that 8 floats can be processed at the same time for each loop. This would mean (if I'm not wrong) that data are 32 bytes aligned (even though as I explain here it seems that the compiler think that data is not aligned).
On average, 2 cycles are totally vectorized, while 3 cycles are remainder loops. The same goes for Min and Max. Otherwise I don't understand what ; means.

Now my question is: how can I follow Intel Advisor first suggestion? It says to "increase the size of objects and add iterations so the trip count is a multiple of vector length"...Ok, so it's simply sayin' "hey man do this so halfWidth*2+1 (since it goes from -halfWidth to +halfWidth is a multiple of 8)". But how can I do this? If I add random cycles, this would obviously break the algorithm!

The only solution that came to my mind is to add "fake" iterations like this:

const int vectorLength = 8;
const int iterations = halfWidth*2+1;
const int remainder = iterations%vectorLength;

for(int i=0; i<loop+length-remainder; i++){
  //this iteration was not supposed to exist, skip it!
  if(i>halfWidth) 
     continue;
}

Of course this code would not work since it goes from -halfWidth to halfWidth, but it's to make you understand my strategy of "fake" iterations.

About the second option ("Increase the size of static and automatic objects, and use a compiler option to add data padding") I have no idea how to implement this.

"This would mean (if I'm not wrong) that data are 32 bytes aligned" - No, there is also an unaligned load operation nowadays. You'd have to target a new enough architecture, though, it's certainly not in SSE2. — MSalters, May 04 '17 at 11:02
"The only solution that came to my mind is to add "fake" iterations like this: `if(i>halfWidth) continue;`. You _did_ notice the `#pragma omp simd` ? As in **Single** Instruction Multiple Data? Because you're proposing a MIMD solution there. For SIMD, the data can depend on `[i]`, but the instructions can't. — MSalters, May 04 '17 at 11:05
@MSalters I'm sorry, but I'm using a AVX2 machine, which means that registers are 256 bit = 32 bytes = 8 float. Am I wrong somehwere? — justHelloWorld, May 04 '17 at 11:05
@MSalters I personally added the `simd` , but I don't understand your comment. I'm simply saying: if we force that the number of `for` iterations are multiple of 8 there will be no remainder loops and this would be more efficient because it will fit perfectly on the register. What am I missing? — justHelloWorld, May 04 '17 at 11:09
You have one register with 8 floats. You can't `continue` that loop for half a register. Almost every AVX2 instruction works on the whole 8 floats. — MSalters, May 04 '17 at 11:11
Ok but then the solution is simple: instead of `if(i>halfWidth) continue)` we do `if(i<=halfWidth) /**do something**/`, it's like the opposite condition of if(i>halfWidth) /**do nothing**/`. — justHelloWorld, May 04 '17 at 11:13
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/143394/discussion-between-msalters-and-justhelloworld). — MSalters, May 04 '17 at 11:14
@MSalters could you please give a look at [this](http://stackoverflow.com/questions/43844396/how-should-i-interpreter-these-vtune-results) question? — justHelloWorld, May 08 '17 at 09:43
Looks like you either need a consultant on-site, or just beefier hardware. I know from your posts that you've done quite a bit yourself, beyond what most programmers could do. But there's a point at which you should call in the experts, or admit that that's not worth the money. — MSalters, May 08 '17 at 10:32
@MSalters thanks for your "beyond what most programms could do", I appreciate it — justHelloWorld, May 08 '17 at 10:34

score 1 · Answer 1 · answered May 21 '20 at 10:44

First, you have to check Vector Advisor Efficiency metric as well as relative time spent in Loop Remainder compared to Loop Body (see hotspots list in advisor). If efficiency is close to 100% (or time spent in Remainder is very small), then it is not worth effort (and money as MSalters mentioned in comments).

If it is << 100% (and there are no other penalties reported by the tool), then you can either refactor the code to "add fake iterations" (rare users can afford it) or you should try #pragma loop_count for most typical #iterations values (depending on typical halfWidth value).

If halfWIdth is totally random (no common or average values), then there is nothing you can really do with this issue.

Ineffective "Peel/Remainder" Loop in my code

1 Answers1

Linked