I have some image processing algorithm, which I implemented in three versions:
- Using x64 instruction set (rax, rbx, ... registers)
- Using SSE instruction set (xmm registers)
- Using AVX2 instruction set (ymm registers)
The performance is improved with each optimization step. However, I need to run it on old CPUs, which only support SSE (I use x64 platform on Visual Studio, so all my CPUs support SSE).
In Visual Studio, there is a setting called "Enable Enhanced Instruction Set", which I must set to /arch:AVX2
to get the best performance on my newer CPUs. However, with this setting the executable crashes on my older CPUs. If I set "Enable Enhanced Instruction Set" to /arch:SSE2
, then my executable works on older CPUs, but I don't get maximum performance on newer CPUs.
I measured execution speed on all combinations of compiler flags and instruction sets, using my newer CPU. The summary is in the following table.
Instruction set || Compilation flags which I use || /arch:SSE /arch:AVX2 ----------------++------------------------------------ x64 || bad (4.6) bad (4.5) SSE || OK (1.9) bad (5.3) AVX2 || bad (3.2) good (1.4)
My vectorized code uses intrinsics, like so:
// AVX2 - conversion from 32-bit to 16-bit
temp = _mm256_packus_epi32(input[0], input[1]);
output = _mm256_permute4x64_epi64(temp, 0xd8);
// SSE - choosing one of two results using a mask
result = _mm_blendv_epi8(result0, result1, mask);
I guess that if Visual Studio gets the /arch:AVX2
compilation flag, it does all the necessary AVX2-specific optimizations, like emitting vzeroupper
. So I don't see how I can get the best performance on both types of CPUs with the same compiled executable file.
Is this possible? If yes, which compilation flags do I need to give to the Visual Studio compiler?