Why gcc autovectorization does not work on convolution matrix biger than 3x3?

Question

I've implemented the following program for convolution matrix

#include <stdio.h>
#include <time.h>

#define NUM_LOOP 1000
#define N 128   //input or output dimention 1
#define M N     //input or output dimention 2
#define P 5 //convolution matrix dimention 1 if you want a 3x3 convolution matrix it must be 3
#define Q P     //convolution matrix dimention 2
#define Csize P*Q   
#define Cdiv  1     //div for filter 
#define Coffset 0   //offset 

//functions
void unusual(); //unusual implementation of convolution
void naive();
//data
unsigned short int input[N][M] __attribute__(( aligned(32))); // input data
unsigned short int output[N][M] __attribute__(( aligned(32))); // out put data
unsigned short int kernel[P][Q] __attribute__(( aligned(32)));//convolution coefficients

int main(){
    struct timespec tStart, tEnd;//used to record the processiing time
    double tTotal , tBest=10000;//minimum of toltal time will asign to the best time

    int w=0;
    do{// this loop repeat the body to record the best time
        clock_gettime(CLOCK_MONOTONIC,&tStart);

        //function to be executed here :

        unusual();

        clock_gettime(CLOCK_MONOTONIC,&tEnd);
        tTotal = (tEnd.tv_sec - tStart.tv_sec);
        tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;

        if(tTotal<tBest)
            tBest=tTotal;
    } while(w++ < NUM_LOOP);

    printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, MAX1, MAX2);

    return 0;
}

//unusual sequential convolution
void unusual(){
    int i, j,k,temp;

    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;
            for(k=0; k< Csize; k++){
                temp += (kernel[k/P][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);

            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}
//The naive implementation
inline void naive(){
    int i, j,k,l,temp;
    for (i=P/2; i< N-P/2; i++){
        for(j=Q/2; j< M-Q/2; j++){
            temp=0;

            for(k = 0; k <  P; k++){ 
                for(l = 0; l <  Q; l++){
                    temp += (kernel[k][l]) * (input[i - (P/2)+k][j - (Q/2)+l]);
                }
            }
            output[i][j]=((temp/(Cdiv))+Coffset);
        }
    }
}

The problem is when I use -O3 for auto vectorizing, it just works for an 3x3 convolution matrix. I've seen the Assembly output and auto vectorization just make some changes for 3x3 kernel and improve the performance reasonably (20 time faster note: scalar version of unusual func is slower than naive fun) but there is no improvement for 5x5 convolution matrix

UPDATE: I added the naive implementation to the question and changed the picture size to NxM, conv matrix to kernel, Cdim1xCdim2 to PxQ, and seqConv function to unusual for clarification. The question is not to improve the implementation of the unusual function. The question is while all elements are in the same places of the memory, gcc uses heuristic, etc. why gcc fails to improve this unusual implementation? NOTE: the problem is not about the naive implementation. gcc -O3 improve the naive implementation for 3x3, 5x5 kernels by ~7 speedup. and it also does for 7x7 and 9x9 by ~1.5 speedup. To improve the convolution I used intrinsics and speedup is more than 40x over the naive implementation which is ~ 2x faster than unusual convolution. So my vectorization is ~80x faster than my unusual one. Hand tuning optimization is not the problem. Auto-vectorizer optimization is the problem, and the reason of the fails.

GCC command : gcc -Wall -march=native -O3 -o "%e" "%f"

Platform: Linux mint, Skylake, gcc 6.2

Thanks in advance

Sure, I just added the missed part. the hole program contain many other functions implemented with AVX2 intrinsics. In the program I aligned all matrices using `__attribute__(( aligned(32)))` — Amiri, Dec 06 '16 at 22:41
I compiled with `#define Cdim1 3` in `clang` and `MVC++` and speedups over `gcc -O2` is `0.97` and `4.34` respectively `clang -O3` and `MVC++ O2` I enabled `/arch:AVX2` and `Enhancement extension` as well `Ot` but there no differences. — Amiri, Dec 06 '16 at 23:52
I couldn't see anything definitive. It's hard to see anything through the huge blob of asm this generates. The only clue I saw was that lots of registers are used for 3x3, 5x5 would take more so possibly for 5x5 GCC gets scared by the amount of stuff it would have to spill .. then again it could be something else, there are a lot of heuristic decisions at that point — harold, Dec 12 '16 at 21:36
Do you think, `gcc` has a 3x3 convolution library? and don't have it for 5x5? — Amiri, Dec 12 '16 at 21:39
It might be about the size of the convolution matrix, `3x3xsizeof(short int) <256 (the vector size)` `whenever 5x5x16>256` — Amiri, Dec 12 '16 at 21:42
Having that matrix in a single register doesn't sound super useful though, it would take some mad shuffles to line up the data with it since it's 2D. But I'll go through that huge mess of asm again and see what it does — harold, Dec 12 '16 at 21:47
There might be an auto vectorization algorithm which is adapted to the registers less than 256 — Amiri, Dec 12 '16 at 21:50
It uses one vector per element of `conv`, so vector size is not the issue. Number of registers still might be, but I'm really not sure. Certainly it cannot use 25 vectors to hold the entries in the 5x5 case, but it could broadcast some on the fly. GCC keeps being uncooperative though, I've spend a lot of time trying to convince it to make nice code for 5x5 but it just doesn't, not even with intrinsics. Well, maybe if I get really explicit and leave nothing to the optimizer, but then I have to unroll by hand and lose generality. — harold, Dec 12 '16 at 23:12
Also GCC is doing all the math with dwords, which isn't necessary (it would be if `Cdiv` stops being 1, except for some specific cases and rounding a bit, it definitely has to be a power of two at least since there is no integer vector division). So `short temp` results in nicer code. Not for 5x5 though, that's still too much to ask for, apparently. GCC is also really careful not overwriting the padding, so every row starts and ends with 15 scalar convolutions, this is easy to avoid with intrinsics (the top and bottom should still have that, or the read will go outside the input). — harold, Dec 12 '16 at 23:16
By intrinsic it's OK. Actually, this program is a hard implementation for 2D convolution. With four nested loops, gcc can improve it significantly. But not for this implementation — Amiri, Apr 01 '17 at 17:17
It seems you're not the fist one looking into this, the Auto vactorization is quite complex and works (or not) very depended on how you write your code. rewriting your code structure might help a lot, I believe linking pages is not always the way to go on stackoverflow, but I would like to link you to the following page: https://locklessinc.com/articles/vectorize/ This guy put quite some work in the vactorization and I hope it could help you solve your problem. — koldewb, Apr 03 '17 at 06:47
@koldewb, Thanks a lot, the link seems very helpful. Yes, it is possible to change. Indeed, I changed the naive implementation with for loops to this with 3 loops. The scalar code slowed down and auto vectorizer did not understand it! — Amiri, Apr 03 '17 at 21:18

Amiri · Answer 1 · 2017-04-08T01:32:07.210

It seems no one is interested in answering this question. So I will share my findings and update my answer in future.

First update: In my experience, gcc -fopt-info-vec reports vectorizing for Csize <= 16 It is because the vectorization factor is 16 and It is one of the reasons that gcc do not vectorize the unusual implementation for other kernel sizes. Vectorization factor refers to the number of elements which can be put in a vector. In this case short integer is equal to 16-bit element.

From wikipedia:

In the first step, the compiler looks for obstacles that can prevent vectorization. A major obstacle for vectorization is true data dependency shorter than the vector length. Other obstacles include function calls and short iteration counts.

ADMS · Answer 2 · 2017-04-09T17:23:37.457

The main obstacle for auto-vectorizer is non-constant loop variant. In your Implementation if you use int Csize = P*Q; It wont be vectorized. So for help the auto vector you should consider this. It's not the problem because you declared the Csize like #define Csize. But note it in your works. Then your unusual implementation is a loop-transformation of the nave implementation which is an optimization method in compilers. It seems you ruined the naive implementation. Your finding says that its restricted because of the 16 so I unrolled your unusual function and auto-vectorizer says it has been vectorized.

for(k=0; k< P*Q; k+=2){
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
}

It also work for 7x7 kernel:

for(k=0; k< P*Q; k+=4){//IACA_START
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + (k/Q)][j - (Q/2) + (k%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+1)/Q)][j - (Q/2) + ((k+1)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+2)/Q)][j - (Q/2) + ((k+2)%Q)]);
                temp += (kernel[k/Q][k%Q]) * (input[i - (P/2) + ((k+3)/Q)][j - (Q/2) + ((k+3)%Q)]);
}

you dont need to unroll it by your self you might be able to force the compiler to unroll or change the loop structure by #pragma attributes. It is because of the SLP concept which compilers use for auto-vectorizing and interestingly SLP is based on unrolling!.

vguberinic · Answer 3 · 2017-04-04T09:31:59.733

My guess is that it is failing to optimize due to issues with memory alignment. You have specified convolution to be of 2-byte shorts. Most SSE functions like to work with 128 bit vectors, and AVX likes 512 bit vectors.

On my machine I declared conv like this:

uint16_t conv[Cdim1][8] = {0}; //You need to pad extra fields with zeroes

And later replace inner loop like this:

for(ki = 0; ki < Cdim; ++ki) 
    for(kj = 0; kj < 8; ++kj)
        temp += (conv[ki][kj]) * (input[i - (Cdim1/2) + ki][j - (Cdim2/2) + kj]);

Compiling with: gcc so.c -Wall -Wextra -Ofast -mtune=native gave me vector optimizations!

Bad things:

Don't use 8. Try to find minimal required padding and make macro so it works with convolutions matrices of dimension >= 8
Pad input with some zeroes so that undefined behavior at the end dissapears
Note that this is actually not increasing your perf. In fact it works slower!
Note that you can squeeze a couple of cycles if you modify this further in way that you execute loops in following order for(ki) for(i) for(j) for(kj). This is probably due to less register pressure since each row of conv can be stored longer. This might also be a glitch on my CPU.
You might want to consider using __attribute__ ((aligned (8))) when declaring variables as well. In this case it didn't change anything, but when optimising you want to consider this as well. Naturally this will work only on GCC and you will need other hacks for MSVC.

Thanks, But you changed the inner loop. As I mentioned in comments and I'll update the question. gcc can vectorize four loop implementation. my gcc 6.2 vectorize it for 3x3 and 5x5 kernels by ~6x speedup. Even gcc improve 7x7 and 9x9 . BTW, the problem is about unusual implementation which is called `seqconv` not the naive implementation. — Amiri, Apr 05 '17 at 00:00

Why gcc autovectorization does not work on convolution matrix biger than 3x3?

3 Answers3

Linked