I am trying to use vectorization in my compiler (Microsoft Visual Studio 2013). One of the problems I am facing is that it doesn't want to use AVX2. While investigating this problem, I constructed the following example, which calculates a sum of 16 numbers, each one 16-bit.
int16_t input1[16] = {0};
int16_t input2[16] = {0};
... // fill the arrays with some data
// Calculate the sum using a loop
int16_t output1[16] = {0};
for (int x = 0; x < 16; x++){
output1[x] = input1[x] + input2[x];
}
The compiler vectorizes this code, but only to SSE instructions:
vmovdqu xmm1, xmmword ptr [rbp+rax]
lea rax, [rax+10h]
vpaddw xmm1, xmm1, xmmword ptr [rbp+rax+10h]
vmovdqu xmmword ptr [rbp+rax+30h], xmm1
dec rcx
jne main+0b0h
To make sure the compiler has the option to generate AVX2 code, I wrote the same calculation as follows:
// Calculate the sum using one AVX2 instruction
int16_t output2[16] = {0};
__m256i in1 = _mm256_loadu_si256((__m256i*)input1);
__m256i in2 = _mm256_loadu_si256((__m256i*)input2);
__m256i out2 = _mm256_add_epi16(in1, in2);
_mm256_storeu_si256((__m256i*)output2, out2);
I see that the two parts of code are equivalent (that is, output11
is equal to output2
after they are executed).
And it outputs AVX2 instructions for the second part of code:
vmovdqu ymm1, ymmword ptr [input2]
vpaddw ymm1, ymm1, ymmword ptr [rbp]
vmovdqu ymmword ptr [output2], ymm1
I don't want to rewrite my code to use intrinsics, however: having it written as a loop is much more natural, is compatible with old (SSE-only) processors, and has other advantages.
So how can I tweak my example to make the compiler be able to vectorize it in AVX2 way?