I am currently working on a program that processes large amounts of data in a tight loop. Blocks of data are loaded into YMM registers, from which 64-bit chunks are extracted to be actually worked on.
This loop is one of several, which the program switches between depending on the exact content of the data being processed. As such, each loop must be occasionally interrupted (sometimes frequently) in order to perform said switching. To make the whole system a bit easier to work on, each loop is contained within its own function.
A fairly major annoyance I've run into (not for the first time), is that it is fairly difficult to preserve the 256-bit and 64-bit chunks across the function calls. Each loop processes the same data, so it doesn't make sense to discard these registers when one breaks, only to immediately load the exact same data back in. This doesn't really cause any major performance problems, but it is measurable, and just seems overall silly.
I've tried about a million different things, with not a single one giving me a proper solution. Of course, I could simply store the chunks within the outer switching loop, and pass them to the inner loops as references, but a quick check of the generated assembly shows that both GCC and Clang revert to pointers no matter what I try, defeating the entire point of the optimization.
I could also just mark each loop as always_inline, turn on LTO, and call it a day, but I plan on adding a hand-written assembly version of one of the loops, and I don't want to be forced to write it inline. Really what I'd like is for the function's declaration to simply signal to callers that the vectors (and associated information) will be passed out of the function as return values, in proper registers, allowing me to reduce the overhead (without inlining) to at most a few register/register mov
s.
The closest thing I've found is the vectorcall
calling convention, supported by MSVC, and at least partially by Clang and GCC.
For reference, I am currently using GCC, but would be willing to switch to Clang if it has a solution to this. If MSVC is the only compiler capable, I'll just go with the inlining option.
I created this simple example:
#include <immintrin.h>
struct HVA4 {
__m256i data[4];
};
HVA4 __vectorcall example(HVA4 x) {
x.data[0] = _mm256_permute4x64_epi64(x.data[0], 0b11001001);
x.data[2] = _mm256_permute4x64_epi64(x.data[2], 0b00111001);
return x;
}
which compiles to
vpermq ymm0, ymm0, 201
vpermq ymm2, ymm2, 57
ret
under MSVC 19.35 using /O2 /GS- /arch:avx2
.
This is actually exactly what I want: my vector parameters are passed in proper SIMD registers, and are returned as such. The registers used even line up! From reading the MSDN docs , it sounds like I should be able to extend this to non-homogeneous aggregates as well, though even if not, I can make this work.
Clang is another story however. On 16.0.0 using -O3 -mavx2
it generates this absolute mess:
mov rax, rcx
vpermpd ymm0, ymmword ptr [rdx], 201
vmovaps ymmword ptr [rdx], ymm0
vpermpd ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rdx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx + 32]
vmovaps ymm1, ymmword ptr [rdx + 96]
vmovaps ymmword ptr [rcx + 96], ymm1
vmovaps ymmword ptr [rcx + 32], ymm0
vmovaps ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rcx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx]
vmovaps ymmword ptr [rcx], ymm0
vzeroupper
ret
I'd show GCC's attempt, but it would probably double the size of this question.
The general idea with is the same, however; both GCC and Clang completely refuse to use multiple registers for SIMD return values, and only sometimes do so for parameters (they fare a lot better if the vectors are removed from the struct). While this may be expected behavior for standard calling conventions (I suspect they're actually following the SysV ABI at least for return value placement), vectorcall
explicitly allows for it.
Of course, vectorcall
is a non-standard attribute, just because two compilers have the same name for something doesn't mean they do the same thing, etc, but at least Clang specifically links to the MSDN docs, so I'd expect it follow them.
Is this simply a bug in clang? Just an unimplemented feature? (Again, it does link to the MSDN docs)
Furthermore, is there any way to achieve the optimizations given by MSVC in code like the example above, in either GCC or Clang, be it via a calling convention, or some compiler specific flag? I'd be happy to try writing a custom convention into the compiler, but that's pretty heavily out of scope for this project.