I am writing a small template library to transpose arbitrary matrices using AVX intrinsics. Since I am heavily using if constexpr
and templates I wanted to make sure, that the compiler is applying all the optimization I expect and benchmarked my code. I came across a result I don't really understand.
The functions have a template parameter that controls how unused register values should be handled. One option is to take whatever ends up there during the performed operations. Another one is to write only to the entries necessary to store the result. I have removed all the template stuff and written a short example for a 7x4 matrix:
EDIT: This code is wrong --- see UPDATE
void Transpose7x4(__m256 in0, __m256 in1, __m256 in2, __m256 in3, __m256& out0, __m256& out1, __m256& out2,
__m256& out3, __m256& out4, __m256& out5, __m256& out6)
{
__m256 tout0, tout1, tout2, tout3, tout4, tout5, tout6;
__m256 tmp0, tmp1, tmp2, tmp3;
__m256 tmp4 = _mm256_unpacklo_ps(in3, in0);
__m256 tmp5 = _mm256_unpackhi_ps(in3, in0);
__m256 tmp6 = _mm256_unpacklo_ps(in1, in2);
__m256 tmp7 = _mm256_unpackhi_ps(in1, in2);
tmp0 = _mm256_shuffle_ps(tmp4, tmp6, 0x44);
tmp1 = _mm256_shuffle_ps(tmp6, tmp4, 0xee);
tmp2 = _mm256_shuffle_ps(tmp5, tmp7, 0x44);
tmp3 = _mm256_shuffle_ps(tmp7, tmp5, 0xee);
tout0 = _mm256_permute2f128_ps(tmp0, tmp0, 0x00);
tout1 = _mm256_permute2f128_ps(tmp1, tmp1, 0x00);
tout2 = _mm256_permute2f128_ps(tmp2, tmp2, 0x00);
tout3 = _mm256_permute2f128_ps(tmp3, tmp3, 0x00);
tout4 = _mm256_permute2f128_ps(tmp0, tmp0, 0x44);
tout5 = _mm256_permute2f128_ps(tmp1, tmp1, 0x44);
tout6 = _mm256_permute2f128_ps(tmp2, tmp2, 0x44);
// Don't care what is written to unused values
out0 = tout0;
out1 = tout1;
out2 = tout2;
out3 = tout3;
out4 = tout4;
out5 = tout5;
out6 = tout6;
// Only write to values necessary to store the result
//out0 = _mm256_blend_ps(out0, tout0, 0xfe);
//out1 = _mm256_blend_ps(out1, tout1, 0xfe);
//out2 = _mm256_blend_ps(out2, tout2, 0xfe);
//out3 = _mm256_blend_ps(out3, tout3, 0xfe);
//out4 = _mm256_blend_ps(out4, tout4, 0xfe);
//out5 = _mm256_blend_ps(out5, tout5, 0xfe);
//out6 = _mm256_blend_ps(out6, tout6, 0xfe);
}
As you can see, the version that does not overwrite unused values needs additional blends, so I expected it to be slightly slower. However, the result of the benchmarks (Clang 8.0.0 and GCC 8.3.0 on an Intel skylake processor) told me otherwise. 100 transpositions gave me around 430ns for the version with blending, while the other version took around 670ns. I checked the assembly if there is something weird happening, but I can't see anything : godbolt
The assembly is more or less identical, only that one version has vmovaps
interleaving with additional vblendps
(and one vperm2f128).
I calculated the expected clock cycles taking instruction pipelining for the _mm256_permute2f128_ps
into account. For the code, without the blending, I came up with 17 cycles. Multiplying by 100 and dividing by my processor frequency delivered 425ns, which is pretty much what I got for the version with blending. The only reason I can see, why the version without blending takes more time is, that instruction pipelining for _mm256_permute2f128_ps
can't be utilized for some reason. If I calculate the expected timings under the assumption, that every _mm256_permute2f128_ps
takes 3 clock cycles I get 725ns, which is much closer to the results I get.
So the question is, why the version with the blends is faster (utilizing instruction pipelining) than the "simpler" version, and how I can fix that.