More aggresive optimization for FMA operations

Question

I want to build a datatype that represents multiple (say N) arithmetic types and provides the same interface as an arithmetic type using operator overloading, such that I get a datatype like Agner Fog's vectorclass.

Please look at this example: Godbolt

#include <array>

using std::size_t;

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD operator*(const T a, const LoopSIMD& x){
        LoopSIMD result;
        for(size_t i=0;i<S;++i)
            result[i] = a*x[i];
        return result;
    }

    LoopSIMD& operator +=(const LoopSIMD& x){
        for(size_t i=0;i<S;++i){
            (*this)[i] += x[i];
        }
        return *this;
    }
};

constexpr size_t N = 7;
typedef LoopSIMD<double,N> SIMD;

SIMD foo(double a, SIMD x, SIMD y){
    x += a*y;
    return x;
}

That seems to work pretty good up to a certain number of elements, which is 6 for gcc-10 and 27 for clang-11. For a larger number of elements the compilers do not use the FMA (e.g. vfmadd213pd) operations anymore. Instead they proceed the multiplications (e.g. vmulpd) and additions (e.g. vaddpd) separately.

Questions:

Is there a good reason for this behavior?
Is there any compiler flag such that I can increase the above mentioned values of 6 for gcc and 27 for clang?

Thank you!

Alex Shirley · Answer 1 · 2020-11-04T17:56:35.003

I did the following, and was able to get some pretty good results, for gcc 10.2 with the same -Ofast -march=skylake -ffast-math as your godbolt link.

friend LoopSIMD operator*(const T a, const LoopSIMD& x) {
    LoopSIMD result;
    std::transform(x.cbegin(), x.cend(), result.begin(),
                   [a](auto const& i) { return a * i; });
    return result;
}

LoopSIMD& operator+=(const LoopSIMD& x) {
    std::transform(this->cbegin(), this->cend(), x.cbegin(), this->begin(),
                   [](auto const& a, auto const& b) { return a + b; });
    return *this;
}

std::transform has some crazy overloads so I think I need to explain.

The first overload captures a, multiplies each value, and stores it back at the beginning of result.

The second overload acts as a zip adding both values together from x and this and storing the result back to this.

If you're not married to operator+= and operator* you can create your own fma like so

    LoopSIMD& fma(const LoopSIMD& x, double a ){
        std::transform_inclusive_scan(
            x.cbegin(),
            x.cend(),
            this->begin(),
            std::plus{},
            [a](auto const& i){return i * a;},
            0.0);
        return *this;
    }

This requires c++17, but will loop keep the SIMD instruction in

foo(double, LoopSIMD<double, 40ul>&, LoopSIMD<double, 40ul> const&):
        xor     eax, eax
        vxorpd  xmm1, xmm1, xmm1
.L2:
        vfmadd231sd     xmm1, xmm0, QWORD PTR [rsi+rax]
        vmovsd  QWORD PTR [rdi+rax], xmm1
        add     rax, 8
        cmp     rax, 320
        jne     .L2
        ret

thanks! But there is a error in your code. The `std::transform` call in the operator call must be `std::transform(x.begin(), x.end(), result.begin(), ...);`. Then i got the same results as in my example. — Nils, Nov 04 '20 at 16:35
Ahhh, good catch. I also forget to add cbegin and cend iterators, and while it does seem to improve things, the core functionality doesn't seem to change much. There might be something that you can do with lazy evaluation, functional programming, and intrinsics, that might improve things, but I think that requires a lot more work — Alex Shirley, Nov 04 '20 at 16:44
Don't forget to make `y` a const ref, you'll see a pretty big reduction on foo as well. Maybe might creating a copy of `x` and making it const ref as well will help — Alex Shirley, Nov 04 '20 at 16:54

score 0 · Answer 2 · answered Nov 04 '20 at 17:39

You could also simply make your own fma function:

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD fma(const LoopSIMD& x, const T y, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(x[i], y, z[i]);
        }
        return result;
    }
    friend LoopSIMD fma(const T y, const LoopSIMD& x, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(y, x[i], z[i]);
        }
        return result;
    }
    // And more variants, taking `const LoopSIMD&, const LoopSIMD&, const T`, `const LoopSIMD&, const T, const T`, etc
};

SIMD foo(double a, SIMD x, SIMD y){
    return fma(a, y, x);
}

But to allow for better optimisations in the first place, you should align your array. Your original code optimises well if you do:

constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T, size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a, SIMD x, SIMD y){
    x += a * y;
    return x;
}

Thanks! I know how to solve that with c++ techniques. I think an alternative would be an expression template that is returned by `operator*` and then applies the fma operations in the `operator +=`. But I was aiming more for the compiler optimizations in the question, as we have already running code, that I don't want to change. — Nils, Nov 05 '20 at 07:22

Nils · Answer 3 · 2021-09-23T12:59:02.433

0

I've found an improvement for the example given.

Adding #pragma omp simd before the loops GCC manages to make the FMA optimization up to N=71.

https://godbolt.org/z/Y3T1rs37W

The size could even more improved if AVX512 is used:

https://godbolt.org/z/jWWPP7W5G

edited Sep 23 '21 at 12:59

answered Sep 20 '21 at 10:29

Nils

31
3

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 20 '21 at 12:34

More aggresive optimization for FMA operations

3 Answers3