Removing multiple _mm256_blend_ps decreases performance instead of increasing it

Question

I am writing a small template library to transpose arbitrary matrices using AVX intrinsics. Since I am heavily using if constexpr and templates I wanted to make sure, that the compiler is applying all the optimization I expect and benchmarked my code. I came across a result I don't really understand.

The functions have a template parameter that controls how unused register values should be handled. One option is to take whatever ends up there during the performed operations. Another one is to write only to the entries necessary to store the result. I have removed all the template stuff and written a short example for a 7x4 matrix:

EDIT: This code is wrong --- see UPDATE

void Transpose7x4(__m256 in0, __m256 in1, __m256 in2, __m256 in3, __m256& out0, __m256& out1, __m256& out2,
                    __m256& out3, __m256& out4, __m256& out5, __m256& out6)
{
    __m256 tout0, tout1, tout2, tout3, tout4, tout5, tout6;
    __m256 tmp0, tmp1, tmp2, tmp3;


    __m256 tmp4 = _mm256_unpacklo_ps(in3, in0);
    __m256 tmp5 = _mm256_unpackhi_ps(in3, in0);
    __m256 tmp6 = _mm256_unpacklo_ps(in1, in2);
    __m256 tmp7 = _mm256_unpackhi_ps(in1, in2);

    tmp0 = _mm256_shuffle_ps(tmp4, tmp6, 0x44);
    tmp1 = _mm256_shuffle_ps(tmp6, tmp4, 0xee);
    tmp2 = _mm256_shuffle_ps(tmp5, tmp7, 0x44);
    tmp3 = _mm256_shuffle_ps(tmp7, tmp5, 0xee);

    tout0 = _mm256_permute2f128_ps(tmp0, tmp0, 0x00);
    tout1 = _mm256_permute2f128_ps(tmp1, tmp1, 0x00);
    tout2 = _mm256_permute2f128_ps(tmp2, tmp2, 0x00);
    tout3 = _mm256_permute2f128_ps(tmp3, tmp3, 0x00);
    tout4 = _mm256_permute2f128_ps(tmp0, tmp0, 0x44);
    tout5 = _mm256_permute2f128_ps(tmp1, tmp1, 0x44);
    tout6 = _mm256_permute2f128_ps(tmp2, tmp2, 0x44);


    // Don't care what is written to unused values
    out0 = tout0;
    out1 = tout1;
    out2 = tout2;
    out3 = tout3;
    out4 = tout4;
    out5 = tout5;
    out6 = tout6;

    // Only write to values necessary to store the result
    //out0 = _mm256_blend_ps(out0, tout0, 0xfe);
    //out1 = _mm256_blend_ps(out1, tout1, 0xfe);
    //out2 = _mm256_blend_ps(out2, tout2, 0xfe);
    //out3 = _mm256_blend_ps(out3, tout3, 0xfe);
    //out4 = _mm256_blend_ps(out4, tout4, 0xfe);
    //out5 = _mm256_blend_ps(out5, tout5, 0xfe);
    //out6 = _mm256_blend_ps(out6, tout6, 0xfe);
}

As you can see, the version that does not overwrite unused values needs additional blends, so I expected it to be slightly slower. However, the result of the benchmarks (Clang 8.0.0 and GCC 8.3.0 on an Intel skylake processor) told me otherwise. 100 transpositions gave me around 430ns for the version with blending, while the other version took around 670ns. I checked the assembly if there is something weird happening, but I can't see anything : godbolt

The assembly is more or less identical, only that one version has vmovaps interleaving with additional vblendps (and one vperm2f128).

I calculated the expected clock cycles taking instruction pipelining for the _mm256_permute2f128_ps into account. For the code, without the blending, I came up with 17 cycles. Multiplying by 100 and dividing by my processor frequency delivered 425ns, which is pretty much what I got for the version with blending. The only reason I can see, why the version without blending takes more time is, that instruction pipelining for _mm256_permute2f128_ps can't be utilized for some reason. If I calculate the expected timings under the assumption, that every _mm256_permute2f128_ps takes 3 clock cycles I get 725ns, which is much closer to the results I get.

So the question is, why the version with the blends is faster (utilizing instruction pipelining) than the "simpler" version, and how I can fix that.

*that every _mm256_permute2f128_ps takes 3 clock cycles* No, that's the *latency* of lane-crossing shuffles. Shuffle throughput is 1/clock on HSW/SKL. How are you measuring performance? Are you sure you're letting the CPU get up to speed (max turbo, not idle)? 4.3ns per execution of the function in your Godbolt link looks about normal: I count 12 shuffle instructions, so 12 cycles / 4.3ns = ~2.8 GHz assuming this saturates the shuffle port and actually runs 1 shuffle per clock without other bottlenecks. You haven't specified anything about your HW or whether you're testing tput or latency — Peter Cordes, Mar 08 '20 at 03:43
@PeterCordes I found the problem in the benchmark itself and added the solution to my post. Your doubts about the timings pushed me to check my benchmarks again. So thanks. I also wrote how I came up with the 17 cycles. I still need to figure out, what the compiler is doing to get rid of 3 inter lane permutations. — wychmaster, Mar 08 '20 at 12:49
@PeterCordes Found the reason why 3 `_mm256_permute2f128_ps` were optimized away. See the update in the answer. — wychmaster, Mar 08 '20 at 14:02
If that's a full answer, post it as an answer instead of giving it a special place as part of the now non-question. — Peter Cordes, Mar 08 '20 at 16:14
I will move the corresponding sections to an answer as soon as I can. — wychmaster, Mar 08 '20 at 19:56

score 1 · Accepted Answer · answered Mar 15 '20 at 10:38

Found the solution. Peter Cordes comment pushed me in the right direction. Something with my benchmark was wrong. I am using google benchmark and here is the original benchmarks code I used:

#include <benchmark/benchmark.h>

#include <x86intrin.h>

#include <array>



class FixtureBenchmark_m256 : public benchmark::Fixture
{
public:
    std::array<std::array<__m256, 8>, 10000> in;
    std::array<std::array<__m256, 8>, 10000> out;

    FixtureBenchmark_m256()
    {
        __m256 tmp0 = _mm256_setr_ps(1, 2, 3, 4, 5, 6, 7, 8);
        for (std::size_t i = 0; i < 1000; ++i)
            for (std::size_t j = 0; j < 8; ++j)
            {
                __m256 tmp1 = _mm256_set1_ps(i * 8 + j);
                in[i][j] = _mm256_mul_ps(tmp0, tmp1);
            }
    }
};



void T7x4_assign(__m256 in0, __m256 in1, __m256 in2, __m256 in3, __m256& out0, __m256& out1, __m256& out2, __m256& out3,
                 __m256& out4, __m256& out5, __m256& out6)
{
    __m256 tout0, tout1, tout2, tout3, tout4, tout5, tout6;
    __m256 tmp0, tmp1, tmp2, tmp3;


    __m256 tmp4 = _mm256_unpacklo_ps(in3, in0);
    __m256 tmp5 = _mm256_unpackhi_ps(in3, in0);
    __m256 tmp6 = _mm256_unpacklo_ps(in1, in2);
    __m256 tmp7 = _mm256_unpackhi_ps(in1, in2);

    tmp0 = _mm256_shuffle_ps(tmp4, tmp6, 0x44);
    tmp1 = _mm256_shuffle_ps(tmp6, tmp4, 0xee);
    tmp2 = _mm256_shuffle_ps(tmp5, tmp7, 0x44);
    tmp3 = _mm256_shuffle_ps(tmp7, tmp5, 0xee);

    tout0 = _mm256_permute2f128_ps(tmp0, tmp0, 0x00);
    tout1 = _mm256_permute2f128_ps(tmp1, tmp1, 0x00);
    tout2 = _mm256_permute2f128_ps(tmp2, tmp2, 0x00);
    tout3 = _mm256_permute2f128_ps(tmp3, tmp3, 0x00);
    tout4 = _mm256_permute2f128_ps(tmp0, tmp0, 0x44);
    tout5 = _mm256_permute2f128_ps(tmp1, tmp1, 0x44);
    tout6 = _mm256_permute2f128_ps(tmp2, tmp2, 0x44);

    out0 = tout0;
    out1 = tout1;
    out2 = tout2;
    out3 = tout3;
    out4 = tout4;
    out5 = tout5;
    out6 = tout6;
}


void T7x4_blend(__m256 in0, __m256 in1, __m256 in2, __m256 in3, __m256& out0, __m256& out1, __m256& out2, __m256& out3,
                __m256& out4, __m256& out5, __m256& out6)
{
    __m256 tout0, tout1, tout2, tout3, tout4, tout5, tout6;
    __m256 tmp0, tmp1, tmp2, tmp3;

    __m256 tmp4 = _mm256_unpacklo_ps(in3, in0);
    __m256 tmp5 = _mm256_unpackhi_ps(in3, in0);
    __m256 tmp6 = _mm256_unpacklo_ps(in1, in2);
    __m256 tmp7 = _mm256_unpackhi_ps(in1, in2);

    tmp0 = _mm256_shuffle_ps(tmp4, tmp6, 0x44);
    tmp1 = _mm256_shuffle_ps(tmp6, tmp4, 0xee);
    tmp2 = _mm256_shuffle_ps(tmp5, tmp7, 0x44);
    tmp3 = _mm256_shuffle_ps(tmp7, tmp5, 0xee);

    tout0 = _mm256_permute2f128_ps(tmp0, tmp0, 0x00);
    tout1 = _mm256_permute2f128_ps(tmp1, tmp1, 0x00);
    tout2 = _mm256_permute2f128_ps(tmp2, tmp2, 0x00);
    tout3 = _mm256_permute2f128_ps(tmp3, tmp3, 0x00);
    tout4 = _mm256_permute2f128_ps(tmp0, tmp0, 0x44);
    tout5 = _mm256_permute2f128_ps(tmp1, tmp1, 0x44);
    tout6 = _mm256_permute2f128_ps(tmp2, tmp2, 0x44);

    out0 = _mm256_blend_ps(out0, tout0, 0xfe);
    out1 = _mm256_blend_ps(out1, tout1, 0xfe);
    out2 = _mm256_blend_ps(out2, tout2, 0xfe);
    out3 = _mm256_blend_ps(out3, tout3, 0xfe);
    out4 = _mm256_blend_ps(out4, tout4, 0xfe);
    out5 = _mm256_blend_ps(out5, tout5, 0xfe);
    out6 = _mm256_blend_ps(out6, tout6, 0xfe);
}



BENCHMARK_F(FixtureBenchmark_m256, 7x4_assign)(benchmark::State& state)
{
    for (auto _ : state)
    {
        for (std::size_t i = 0; i < 100; ++i)
        {
            T7x4_assign(in[i][0], in[i][1], in[i][2], in[i][3], out[i][0], out[i][1], out[i][2], out[i][3], out[i][4],
                        out[i][5], out[i][6]);
            benchmark::ClobberMemory();
        }
    }
}

BENCHMARK_F(FixtureBenchmark_m256, 7x4_blend)(benchmark::State& state)
{
    for (auto _ : state)
    {
        for (std::size_t i = 0; i < 100; ++i)
        {
            T7x4_blend(in[i][0], in[i][1], in[i][2], in[i][3], out[i][0], out[i][1], out[i][2], out[i][3], out[i][4],
                       out[i][5], out[i][6]);
            benchmark::ClobberMemory();
        }
    }
}

BENCHMARK_MAIN();

This gave the output:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
FixtureBenchmark_m256/7x4_assign        646 ns          646 ns      1081509
FixtureBenchmark_m256/7x4_blend         380 ns          380 ns      1847485

The problem here is the loop. I can't really say what exactly is happening, maybe cache misses or some weird loop optimizations, but removing the loop gives the expected timings:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
FixtureBenchmark_m256/7x4_assign       3.27 ns         3.27 ns    214698649
FixtureBenchmark_m256/7x4_blend        4.15 ns         4.14 ns    168642478

So why the loops in the first place? This was because of installing google benchmark in ubuntu using sudo apt-get install libbenchmark-dev. The problem is, that this is a debug build and nanosecond timings are rounded in this version. So I couldn't see any difference for a single execution and timed multiple function calls with a loop. However, after manually building and installing the release version I got more accurate timings and could remove the loop, which negatively affected the benchmark.

An additional remark: I also miscalculated the expected CPU cycles. I didn't use the optimized assembly but the intrinsics. So I came up with 8 normal shuffles and 7 inter lane shuffles which give 15. Adding the inevitable latency of the last inter lane permutation (2 extra cycles) gave 17. However, the compiler optimizes 3 _mm256_permute2f128_ps away which gives 14 (12 shuffles - as Peter Cordes said - plus 2 cycles latency). Dividing by my cpu frequency of 4.2 gives 3.33 which is quite close to the benchmark result.

UPDATE

I was wondering, why the compiler optimized away 3 _mm256_permute2f128_ps calls. In my library, the intrinsics are generalized to easily swap the register type. Additionally, all masks are calculated automatically. So I made some mistakes when I replaced all the library calls. Here is the correct code:

void Transpose7x4(__m256 in0, __m256 in1, __m256 in2, __m256 in3, __m256& out0, __m256& out1, __m256& out2,
                    __m256& out3, __m256& out4, __m256& out5, __m256& out6)
{
__m256 tout0, tout1, tout2, tout3, tout4, tout5, tout6;
    __m256 tmp0, tmp1, tmp2, tmp3;


    __m256 tmp4 = _mm256_unpacklo_ps(in3, in0);
    __m256 tmp5 = _mm256_unpackhi_ps(in3, in0);
    __m256 tmp6 = _mm256_unpacklo_ps(in1, in2);
    __m256 tmp7 = _mm256_unpackhi_ps(in1, in2);

    tmp0 = _mm256_shuffle_ps(tmp4, tmp6, 0x44);
    tmp1 = _mm256_shuffle_ps(tmp4, tmp6, 0xee);
    tmp2 = _mm256_shuffle_ps(tmp5, tmp7, 0x44);
    tmp3 = _mm256_shuffle_ps(tmp5, tmp7, 0xee);


    tout0 = _mm256_permute2f128_ps(tmp0, tmp0, 0x00);
    tout1 = _mm256_permute2f128_ps(tmp1, tmp1, 0x00);
    tout2 = _mm256_permute2f128_ps(tmp2, tmp2, 0x00);
    tout3 = _mm256_permute2f128_ps(tmp3, tmp3, 0x00);
    tout4 = _mm256_permute2f128_ps(tmp0, tmp0, 0x33);
    tout5 = _mm256_permute2f128_ps(tmp1, tmp1, 0x33);
    tout6 = _mm256_permute2f128_ps(tmp2, tmp2, 0x33);


    out0 = tout0;
    out1 = tout1;
    out2 = tout2;
    out3 = tout3;
    out4 = tout4;
    out5 = tout5;
    out6 = tout6;

    //out0 = _mm256_blend_ps(out0, tout0, 0xfe);
    //out1 = _mm256_blend_ps(out1, tout1, 0xfe);
    //out2 = _mm256_blend_ps(out2, tout2, 0xfe);
    //out3 = _mm256_blend_ps(out3, tout3, 0xfe);
    //out4 = _mm256_blend_ps(out4, tout4, 0xfe);
    //out5 = _mm256_blend_ps(out5, tout5, 0xfe);
    //out6 = _mm256_blend_ps(out6, tout6, 0xfe);
}

Now all instruction (8 shuffles and 7 inter lane shuffles) turn up in the assembly as expected:

godbolt

Removing multiple _mm256_blend_ps decreases performance instead of increasing it

1 Answers1