Why this vectorization fails on AVX-512 and not on AVX2?

Question

I have this code which I test on my AVX2 machine:

bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{         
   bool ret = false;
   // input size (-1 for the safe bilinear interpolation)
   const int width = im.cols-1;
   const int height = im.rows-1;
   // output size
   const int halfWidth  = res.cols >> 1;
   const int halfHeight = res.rows >> 1;
   float *out = res.ptr<float>(0);
   const float *imptr  = im.ptr<float>(0);
   for (int j=-halfHeight; j<=halfHeight; ++j)
   {
      const float rx = ofsx + j * a12;
      const float ry = ofsy + j * a22;
      #pragma omp simd
      for(int i=-halfWidth; i<=halfWidth; ++i, out++)
      {
         float wx = rx + i * a11;
         float wy = ry + i * a21;
         const int x = (int) floor(wx);
         const int y = (int) floor(wy);
         if (x >= 0 && y >= 0 && x < width && y < height)
         {
            // compute weights
            wx -= x; wy -= y;
            int rowOffset = y*im.cols;
            int rowOffset1 = (y+1)*im.cols;
            // bilinear interpolation
            *out =
                (1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x]   + wx * imptr[rowOffset+x+1]) +
                (       wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);
         } else {
            *out = 0;
            ret =  true; // touching boundary of the input            
         }
      }
   }
   return ret;
}

As suggested by Intel Advisor, I added #pragma omp simd to force vectorization since the compiler (icpc 2017 update 3) assumed an inexistent dependency. On my AVX2 machine this doesn't produce any error and actually improve perfomance.

However, on the AVX-512 machine (with same compiler and version) this generates a segmentation fault. Why this happens?

The compilation flags are the same, expect that one use -xCORE-AVX2 and the other one -xMIC-AVX512. This is the complete set of compilation flags:

INTEL_OPT=-O3 -ipo -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2 -fma -align -finline-functions
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl

That code looks oddly familiar ;) Could the Segmentation Fault be on the last few elements? AVX-512 obviously uses 256 bits more than AVX-256, and if your input size is an odd multiple of 256 bits (32 bytes) then you might read past the end of a page with 512 bits at a time. — MSalters, May 19 '17 at 12:13
@MSalters Thanks for your comment, I'm glad that you remember lol. I don't understand what you mean by "input size", but if you mean the number of cycles in the vectorized `for`, then it's highly unstable. Some `halfWidth` examples are 20, 9, 17... while `halfHeight` values are 48, 20, 9, 43... If you mean the `im` size some examples of `width` are 1368, 683, 62 ... while `height` 50, 1061, 58, ... — justHelloWorld, May 19 '17 at 13:03
@MSalters what I mean is that if it was a multiple problem, with these values should have happened for AVX2 too, don't you think? Besides, I thought that `omp simd` was taking care of creating reminder and peeling loops — justHelloWorld, May 19 '17 at 13:04
@MSalters notice that the values that I reported are not the correspondent to each other, I just reported random number that I've seen from their printing. — justHelloWorld, May 19 '17 at 13:05
No experience with AVX-512 I'm afraid, and that seems to be the crucial difference here. — MSalters, May 20 '17 at 18:16
Where does the segmentation fault happen? (Compile it with `-g` and run under `gdb`). Also you can try to analyze it with `valgrind`. — Ilya Verbin, May 21 '17 at 16:44
In ` *out = (1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x] + wx * imptr[rowOffset+x+1]) + ( wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);` — justHelloWorld, May 21 '17 at 18:01

Why this vectorization fails on AVX-512 and not on AVX2?

0 Answers0