Why doesn't MSVC's auto-vectorization use AVX2?

Question

I am trying to use vectorization in my compiler (Microsoft Visual Studio 2013). One of the problems I am facing is that it doesn't want to use AVX2. While investigating this problem, I constructed the following example, which calculates a sum of 16 numbers, each one 16-bit.

int16_t input1[16] = {0};
int16_t input2[16] = {0};
... // fill the arrays with some data

// Calculate the sum using a loop
int16_t output1[16] = {0};
for (int x = 0; x < 16; x++){
    output1[x] = input1[x] + input2[x];
}

The compiler vectorizes this code, but only to SSE instructions:

vmovdqu  xmm1, xmmword ptr [rbp+rax]
lea      rax, [rax+10h]
vpaddw   xmm1, xmm1, xmmword ptr [rbp+rax+10h]
vmovdqu  xmmword ptr [rbp+rax+30h], xmm1
dec      rcx
jne      main+0b0h

To make sure the compiler has the option to generate AVX2 code, I wrote the same calculation as follows:

// Calculate the sum using one AVX2 instruction
int16_t output2[16] = {0};
__m256i in1 = _mm256_loadu_si256((__m256i*)input1);
__m256i in2 = _mm256_loadu_si256((__m256i*)input2);
__m256i out2 = _mm256_add_epi16(in1, in2);
_mm256_storeu_si256((__m256i*)output2, out2);

I see that the two parts of code are equivalent (that is, output11 is equal to output2 after they are executed).

And it outputs AVX2 instructions for the second part of code:

vmovdqu  ymm1, ymmword ptr [input2]
vpaddw   ymm1, ymm1, ymmword ptr [rbp]
vmovdqu  ymmword ptr [output2], ymm1

I don't want to rewrite my code to use intrinsics, however: having it written as a loop is much more natural, is compatible with old (SSE-only) processors, and has other advantages.

So how can I tweak my example to make the compiler be able to vectorize it in AVX2 way?

I'm only guessing here but from what I have seen of the talks on AVX in Visual Studio the implementation still seems to be immature and is known to not take advantage of all possible optimisations. Another possibility is that the optimiser has decided that the best performance is to not use the AVX2 instruction in all circumstances where it would be possible. For example there are situations where the AVX instruction can cause a cache miss which means it actually ends up slower than the more naive approach. — sjdowling, Nov 11 '14 at 13:16
Download CPU-z for example and check if your CPU supports AVX or AVX2. It if doesn't, i bet this is why Visual Studio block your setup — NirMH, Nov 11 '14 at 13:37
@NirMH I think it should be clear from my code that it does support AVX2. I mean, the second part of my code involves one such instruction. And I mentioned that it produces a correct result (the check is not in code but I really did that check). — anatolyg, Nov 11 '14 at 13:50
@sjdowling So, maybe "not take advantage of all possible optimisations" means "not take advantage of AVX2"? — anatolyg, Nov 11 '14 at 14:00
@anatolyg by that I meant it doesn't use AVX2 is all possible places where it could in theory. I'm basing this from memory of a Going Native talk but I think they have to be extremely careful about where it is used lest it introduce performance regressions such as the one described. In fact I think that was one that did actually occur. — sjdowling, Nov 11 '14 at 14:05
@sjdowling You mentioned some "talks" on AVX; were they conversations you did personally, or are they in the Internet somewhere (e.g. youtube)? — anatolyg, Nov 11 '14 at 15:02
@anatolyg I think it might have been this video http://channel9.msdn.com/Events/Build/2012/3-013 — sjdowling, Nov 11 '14 at 15:10
Your loop is misleading. Why don't you write `short input1[16], input2[16]; output1[16]; for(int i=0; i<16; i++) output1[i] = input1[i] + input2[i];` and get rid of all the `__m256i` stuff. That just causes distraction. — Z boson, Nov 11 '14 at 15:38
I gave up on MSVC a while back. It sucks for optimization (and for OpenMP). Use GCC then convert the object files to work with windows [converting-c-object-file-from-linux-o-to-windows-obj](https://stackoverflow.com/questions/4770918/converting-c-object-file-from-linux-o-to-windows-obj/21212320#21212320) — Z boson, Nov 12 '14 at 19:29
A bit off-topic but GCC (via MinGW) generates faster code these days on Windows for a lot of use-cases. You can try compiling with gcc -O3 -march=native to see if it autovectorizes this code (it should) — Piotr Lopusiewicz, Nov 14 '14 at 23:48
@PiotrLopusiewicz, he could also do `gcc -O3 -march=native -mabi=ms` and then convert the object file to PE+ to work on Windows. — Z boson, Nov 25 '14 at 14:00
Just suggesting the obvious here: did you check which optimizations are enabled in the settings of MSVS? — stanm, Jan 28 '15 at 11:17

score 0 · Accepted Answer · answered Feb 23 '15 at 20:14

Visual Studio easily produces AVX2 code when doing floating point arithmetic. I guess this is enough to declare that "VS2013 supports AVX2".

However, no matter what I did, VS2013 didn't produce AVX2 code for integer calculations (neither int16_t nor int32_t worked), so I guess this is not supported at all (gcc produces AVX2 for my code at version 4.8.2; not sure about earlier versions).

If I had to do calculations on int32_t, I could consider converting them to float and back. However, since I use int16_t, it doesn't help.

Why doesn't MSVC's auto-vectorization use AVX2?

1 Answers1