How to take advantage of AVX2 on new CPUs while also supporting old CPUs?

Question

I have some image processing algorithm, which I implemented in three versions:

Using x64 instruction set (rax, rbx, ... registers)
Using SSE instruction set (xmm registers)
Using AVX2 instruction set (ymm registers)

The performance is improved with each optimization step. However, I need to run it on old CPUs, which only support SSE (I use x64 platform on Visual Studio, so all my CPUs support SSE).

In Visual Studio, there is a setting called "Enable Enhanced Instruction Set", which I must set to /arch:AVX2 to get the best performance on my newer CPUs. However, with this setting the executable crashes on my older CPUs. If I set "Enable Enhanced Instruction Set" to /arch:SSE2, then my executable works on older CPUs, but I don't get maximum performance on newer CPUs.

I measured execution speed on all combinations of compiler flags and instruction sets, using my newer CPU. The summary is in the following table.

Instruction set ||        Compilation flags
which I use     ||     /arch:SSE     /arch:AVX2
----------------++------------------------------------
x64             ||     bad (4.6)      bad (4.5)
SSE             ||     OK  (1.9)      bad (5.3)
AVX2            ||     bad (3.2)     good (1.4)

My vectorized code uses intrinsics, like so:

// AVX2 - conversion from 32-bit to 16-bit
temp = _mm256_packus_epi32(input[0], input[1]);
output = _mm256_permute4x64_epi64(temp, 0xd8);

// SSE - choosing one of two results using a mask
result = _mm_blendv_epi8(result0, result1, mask);

I guess that if Visual Studio gets the /arch:AVX2 compilation flag, it does all the necessary AVX2-specific optimizations, like emitting vzeroupper. So I don't see how I can get the best performance on both types of CPUs with the same compiled executable file.

Is this possible? If yes, which compilation flags do I need to give to the Visual Studio compiler?

The Intel Performance Primitives use separate binary implementations per CPU family, selected and loaded from DLLs at runtime, so you might have to go that way. I am surprised that the AVX2 CPU runs SSE code slowly though. — Rup, Apr 08 '19 at 10:14
Visual Studio is not going to duplicate the code automatically and them use the appropriate version depending on actual CPU, so you have to deal with this yourselves. Eventually you need to provide several code paths and it is up to you how to do this exactly. As you want to leverage `/arch` optimizations you most likely want to build several versions of the same DLL and load the most suitable one once you detect actual CPU capabilities. — Roman R., Apr 08 '19 at 10:22
@RomanR.: Other compilers can do runtime dispatching to clones compiled with different target options, though. ICC has it with options only for unmodified source, and GCC will do it with `ifunc` stuff. — Peter Cordes, Apr 08 '19 at 11:11

score 2 · Accepted Answer · edited Apr 08 '19 at 10:37

The way Intel does this is CPU dispatching (check the ax flag in the Intel compiler documentation). The ax flag is specific to the Intel compiler and makes implicit CPU dispatching. It's not available on VS, so you have to do it manually.

At the beginning of your code, you check your CPU features and you set some global flags somewhere.

Then, when you call one of your functions, you first check the flag state to see which function you actually want to call.

So you end up with different flavors of your functions. To deal with this, you can put them in a different specific namespace (like libsimdpp does) or you manually mangle your function name (like the Intel compiler does).

Also, any CPU that is 64-bit, has support for SSE2 by construction, so case 1 is inexistent.

The `ax` flag is specific to the Intel compiler and makes implicit CPU dispatching. It's not available on VS, you have to do it manually. — Matthieu Brucher, Apr 08 '19 at 10:33

How to take advantage of AVX2 on new CPUs while also supporting old CPUs?

1 Answers1