20

I am doing some image processing, for which I benefit from vectorization. I have a function that vectorizes ok, but for which I am not able to convince the compiler that the input and output buffer have no overlap, and so no alias checking is necessary. I should be able to do so using __restrict__, but if the buffers are not defined as __restrict__ when arriving as function argument, there is no way to convince the compiler that I am absolutely sure that 2 buffers will never overlap.

This is the function:

__attribute__((optimize("tree-vectorize","tree-vectorizer-verbose=6")))
void threshold(const cv::Mat& inputRoi, cv::Mat& outputRoi, const unsigned char th) {

    const int height = inputRoi.rows;
    const int width = inputRoi.cols;

    for (int j = 0; j < height; j++) {
        const uint8_t* __restrict in = (const uint8_t* __restrict) inputRoi.ptr(j);
        uint8_t* __restrict out = (uint8_t* __restrict) outputRoi.ptr(j);
        for (int i = 0; i < width; i++) {
           out[i] = (in[i] < valueTh) ? 255 : 0;
        }
    }
}

The only way I can convince the compiler to not perform the alias checking is if I put the inner loop in a separate function, in which the pointers are defined as __restrict__ arguments. If I declare this inner function as inlined, again the alias checking is activated.

You can see the effect also with this example, which I think is consistent: http://goo.gl/7HK5p7

(Note: I know there might be better ways of writing the same function, but in this case I am just trying to understand how to avoid alias check)

Edit:
Problem is solved!! (See answer below)
Using gcc 4.9.2, here is the complete example. Note the use of the compiler flag -fopt-info-vec-optimized in place of the superseded -ftree-vectorizer-verbose=N.
So, for gcc, use #pragma GCC ivdep and enjoy! :)

Community
  • 1
  • 1
Antonio
  • 19,451
  • 13
  • 99
  • 197
  • 1
    Note that the inlining issue may get fixed for gcc-5: https://gcc.gnu.org/ml/gcc-patches/2014-09/msg00606.html – Marc Glisse Sep 19 '14 at 13:08
  • Thanks for showing the c++ web compiler – StarShine Oct 03 '14 at 11:32
  • I do not have an copy of openCV ready to test, but perhaps you can convince the compiler that `inputRoi` and `outputRoi` refer to different buffers by utilizing an `__assume(in != out)` statement? There is a lot you can do with `__assume`, but it depends a lot on the case if the compiler is smart enough to make sense of it. – Stefan Oct 21 '14 at 21:16
  • 2
    @Stefan Assuming `in != out` is definitely not enough information for the compiler: the buffers might _partially_ overlap – Antonio Oct 22 '14 at 07:56

3 Answers3

5

if you are using Intel compiler, you can try to include the line:

#pragma ivdep 

The following paragraph is quoted from Intel compiler user manual:

The ivdep pragma instructs the compiler to ignore assumed vector dependencies. To ensure correct code, the compiler treats an assumed dependence as a proven dependence, which prevents vectorization. This pragma overrides that decision. Use this pragma only when you know that the assumed loop dependencies are safe to ignore.

In gcc, one should add the line:

#pragma GCC ivdep

inside the function and right before the loop you want to vectorize (see documentation). This is only supported starting from gcc 4.9 and, by the way, makes the use of __restrict__ redundant.

Antonio
  • 19,451
  • 13
  • 99
  • 197
PhD AP EcE
  • 3,751
  • 2
  • 17
  • 15
  • This seems a very good hint! I cannot test quickly on http://gcc.godbolt.org because gcc 4.9 compiler is missing, and I understand this feature was not there in previous compiler versions... – Antonio Jan 19 '15 at 10:50
  • Please let me know whether it works or not. For about two third of time this trick works for simple loops. If you still have difficulty to auto-vectorize your code, assigning input pointers to dummy pointers inside the function might be able to trick the compiler. – PhD AP EcE Jan 19 '15 at 11:48
  • 1
    And please put #pragma inside the nested loop. #pragma ivdep does not work outside nested loop most of time (intel compiler) – PhD AP EcE Jan 19 '15 at 11:53
  • I tested on MinGw with gcc 4.9.2, it works! Note: I put the pragma directive just before the loop in my case; and the `__restrict__` becomes superfluous, I am updating the answer in this sense. – Antonio Jan 19 '15 at 12:18
2

Another approach for this specific issue that is standardised and fully portable across (reasonably modern) compiler is to use the OpenMP simd directive, which is part of the standard since version 4.0. The code then becomes:

void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
               unsigned char* outputRoi, const int width,
               const int stride, const int height) {
    #pragma omp simd
    for (int i = 0; i < width; i++) {
        outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
    }
}

And when compiled with OpenMP support enabled (with either full support or only partial one for simd only, like with -qopenmp-simd for the Intel compiler), then the code is fully vectorised.

In addition, this gives you the opportunity to indicate possible alignment of vectors, which can come handy in some circumstances. For example, had your input and output arrays been allocated with an alignment-aware memory allocator, such a posix_memalign() with an alignment requirement of 256b, then the code could become:

void threshold(const unsigned char* inputRoi, const unsigned char valueTh,
               unsigned char* outputRoi, const int width,
               const int stride, const int height) {
    #pragma omp simd aligned(inputRoi, outputRoi : 32)
    for (int i = 0; i < width; i++) {
        outputRoi[i] = (inputRoi[i] < valueTh) ? 255 : 0;
    }
}

This should then permit to generate an even faster binary. And this feature isn't readily available using the ivdep directives. All the more reasons to use the OpenMP simd directive.

Gilles
  • 9,269
  • 4
  • 34
  • 53
1

The Intel compiler at least as of version 14 does not generate aliasing checks for threshold2 in the code you linked indicating that your approach should work. However, the gcc auto-vectorizer misses this opportunity for optimization but does generate vectorized code, tests for proper alignment, tests for aliasing and non-vectorized fall-back/clean-up code.

user1225999
  • 912
  • 8
  • 14
  • The question is sharp enough, things like what could be the impact or about memory alignment is not relevant. I find the relevant part of your answer is the outcome of the Intel compiler, and that could live in a simple comment. – Antonio Nov 07 '14 at 09:22
  • @Antonio: Why is it relevant for you whether gcc generates an alias check or not? – user1225999 Nov 07 '14 at 13:18
  • Because for my use case (image processing) that check will happen several time (at each image row, for an image or subimage that might have a small number of columns) – Antonio Nov 07 '14 at 14:53
  • @Antonio: Fair enough. – user1225999 Nov 07 '14 at 17:24