Slow performances of CPUs set with AVX2 in handling int and double datatypes in C++

Question

I have a strange problem with some AVX / AVX2 codes that I am working on. I have set up a test application console developed in cpp (Visual Studio 2017 on Windows 7) with the aim of comparing the routines written in Cpp with the equivalent routine written with the set-instruction AVX / AVX2; each routine is timed. A first problem: the timed time of the single routine changes according to the position of the call of the same;

void TraditionalAVG_UncharToDouble(const unsigned char *vec1, const unsigned char *vec2, double* doubleArray, const unsigned int length) {
    int sumTot = 0;
    double* ptrDouble = doubleArray;
    for (unsigned int packIdx = 0; packIdx < length; ++packIdx) {
        *ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2);
        ptrDouble++;
    }

}
void AVG_uncharToDoubleArray(const unsigned char *vec1, const unsigned char *vec2, double* doubleArray, const unsigned int length) {
    //constexpr unsigned int memoryAlignmentBytes = 32;
    constexpr unsigned int bytesPerPack = 256 / 16;
    unsigned int packCount = length / bytesPerPack;

    double* ptrDouble = doubleArray;

    __m128d divider=_mm_set1_pd(2);

    for (unsigned int packIdx = 0; packIdx < packCount; ++packIdx)
    {
        auto x1 = _mm_loadu_si128((const __m128i*)vec1);
        auto x2 = _mm_loadu_si128((const __m128i*)vec2);

        unsigned char index = 0;

        while(index < 8) {
            index++;
            auto x1lo = _mm_cvtepu8_epi64(x1);
            auto x2lo = _mm_cvtepu8_epi64(x2);
            __m128d x1_pd = int64_to_double_full(x1lo);
            __m128d x2_pd = int64_to_double_full(x2lo);

            _mm_store_pd(ptrDouble, _mm_div_pd(_mm_add_pd(x1_pd, x2_pd), divider));
            ptrDouble = ptrDouble + 2;
            x1 = _mm_srli_si128(x1, 2);
            x2 = _mm_srli_si128(x2, 2);
        }
        vec1 += bytesPerPack;
        vec2 += bytesPerPack;
    }

    for (unsigned int ii = 0 ; ii < length % packCount; ++ii)
    {
        *(ptrDouble + ii) = (double)(*(vec1 + ii) + *(vec2 + ii))/ (double)2;
    }

    }

... on main ...

timeAvg02 = 0;
Start_TimerMS();
AVG_uncharToDoubleArray(unCharArray, unCharArrayBis, doubleArray, N);
End_TimerMS(&timeAvg02);
std::cout << "AVX2_AVG UncharTodoubleArray:: " << timeAvg02 << " ms" << std::endl;
//printerDouble("AvxDouble", doubleArray, N);
std::cout << std::endl;

timeAvg01 = 0;
Start_TimerMS3();
TraditionalAVG_UncharToDouble(unCharArray, unCharArrayBis, doubleArray, N);
End_TimerMS3(&timeAvg01);
std::cout << "Traditional_AVG UncharTodoubleArray: " << timeAvg01 << " ms" << std::endl;
//printerDouble("TraditionalAvgDouble", doubleArray, N);
std::cout << std::endl;

the second problem is that routines written in AVX2 are slower than routines written in cpp. The images represent the results of the two tests

Result of the test 1

Result of the test 2

How can I overcome this strange behavior? What is the reason behind it?

Try to disassemble the C compiler code, but I would hazard a guess that it doesn't use he assembly divide instruction, but just subtracts 1 from the exponent. Take a look at https://stackoverflow.com/questions/7720668/fast-multiplication-division-by-2-for-floats-and-doubles-c-c — YuvGM, Oct 10 '22 at 15:23
Likely that the optimizer in VS2017 simply was able to generate code that was more efficient than your manually crafted AVX stuff. It is a common problem to over-optimize stuff, and end up with something that is in fact slower. Can't you use the regular code and tell VS to use AVX2 instead? (compiler flag /arch:AVX2) — Sven Nilsson, Oct 10 '22 at 16:13
What's `int64_to_double_full`, and why are you using it on values zero-extended from `char`? You only need to widen to 32-bit elements so you can use hardware `_mm256_cvtepi32_pd`. Also, MSVC doesn't optimize intrinsics much if at all; it might not be optimizing `__m128d divider=_mm_set1_pd(2);` into a multiply or FMA by `0.5`. — Peter Cordes, Oct 10 '22 at 16:41
Also, why is your scalar C obfuscated as `*ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2);` instead of `(vec1[packIdx] + vec2[packIdx]) / 2.0`? Either way I'd expect that to auto-vectorize nicely to **integer addition** before widening to FP. — Peter Cordes, Oct 10 '22 at 16:44

Peter Cordes · Answer 1 · 2022-10-10T21:06:41.107

MSVC doesn't optimize intrinsics (much), so you get an actual vdivpd by 2.0, not a multiply by 0.5. That's a worse bottleneck than scalar, less than one element per clock cycle on most CPUs. (e.g. Skylake / Ice Lake / Alder Lake-P: 4 cycle throughput for vdivpd xmm, or 8 cycles for vdivpd ymm, either way 2 cycles per element. https://uops.info)

From Godbolt, with MSVC 19.33 -O2 -arch:AVX2, with a version that compiles (replacing your undefined int64_to_double_full with efficient 32-bit conversion). Your version is probably even worse.

$LL5@AVG_unchar:
        vpmovzxbd xmm0, xmm5
        vpmovzxbd xmm1, xmm4
        vcvtdq2pd xmm3, xmm0
        vcvtdq2pd xmm2, xmm1
        vaddpd  xmm0, xmm3, xmm2
        vdivpd  xmm3, xmm0, xmm6         ;; performance disaster
        vmovupd XMMWORD PTR [r8], xmm3
        add     r8, 16
        vpsrldq xmm4, xmm4, 2
        vpsrldq xmm5, xmm5, 2
        sub     rax, 1
        jne     SHORT $LL5@AVG_unchar

Also, AVX2 implies support for 256-bit integer as well as FP vectors, so you can use __m256i. Although with this shift strategy for using the chars of a vector, you wouldn't want to. You'd just want to use __m256d.

Look at how clang vectorizes the scalar C++: https://godbolt.org/z/Yzze98qnY 2x vpmovzxbd-load of __m128i / vpaddd __m128i / vcvtdq2pd to __m256d / vmulpd __m256d (by 0.5) / vmovupd. (Narrow loads as a memory source for vpmovzxbd are good, especially with an XMM destination so they can micro-fuse on Intel CPUs. Writing this with intrinsics relies on compilers optimizing _mm_loadu_si32 into a memory source for _mm_cvtepu8_epi32. Looping to use all bytes of a wider load isn't crazy, but costs more shuffles. clang unrolls that loop, replacing later vpsrldq / vpmovzxbd with vpshufb shuffles to move bytes directly to where they're needed, at the cost of needing more constants.)

IDK what wrong with MSVC, why it failed to auto-vectorize with -O2 -arch:AVX2, but at least it optimized /2.0 to *0.5. When the reciprocal is exactly representable as a double, that's a well-known safe and valuable optimization.

With a good compiler, there'd be no need for intrinsics. But "good" seems to only include clang; GCC makes a bit of a mess with converting vector widths.

Your scalar C is strangely obfuscated as *ptrDouble = ((double)(*(vec1 + packIdx) + *(vec2 + packIdx)))/ ((double)2); instead of
(vec1[packIdx] + vec2[packIdx]) / 2.0.

Doing integer addition like this scalar code before conversion to FP is a good idea, especially for a vectorized version, so there's only one conversion. Each input already needs to get widened separately to 32-bit elements.

IDK what int64_to_double_full is, but if it's manual emulation of AVX-512 vcvtqq2pd, it makes no sense to use use it on values zero-extended from char. That value-range fits comfortably in int32_t, so you can widen only to 32-bit elements, and let hardware int->FP packed conversion with _mm256_cvtepi32_pd (vcvtdq2pd) widen the elements.

Slow performances of CPUs set with AVX2 in handling int and double datatypes in C++

1 Answers1