2

I am currently working on a program that processes large amounts of data in a tight loop. Blocks of data are loaded into YMM registers, from which 64-bit chunks are extracted to be actually worked on.

This loop is one of several, which the program switches between depending on the exact content of the data being processed. As such, each loop must be occasionally interrupted (sometimes frequently) in order to perform said switching. To make the whole system a bit easier to work on, each loop is contained within its own function.

A fairly major annoyance I've run into (not for the first time), is that it is fairly difficult to preserve the 256-bit and 64-bit chunks across the function calls. Each loop processes the same data, so it doesn't make sense to discard these registers when one breaks, only to immediately load the exact same data back in. This doesn't really cause any major performance problems, but it is measurable, and just seems overall silly.

I've tried about a million different things, with not a single one giving me a proper solution. Of course, I could simply store the chunks within the outer switching loop, and pass them to the inner loops as references, but a quick check of the generated assembly shows that both GCC and Clang revert to pointers no matter what I try, defeating the entire point of the optimization.

I could also just mark each loop as always_inline, turn on LTO, and call it a day, but I plan on adding a hand-written assembly version of one of the loops, and I don't want to be forced to write it inline. Really what I'd like is for the function's declaration to simply signal to callers that the vectors (and associated information) will be passed out of the function as return values, in proper registers, allowing me to reduce the overhead (without inlining) to at most a few register/register movs.

The closest thing I've found is the vectorcall calling convention, supported by MSVC, and at least partially by Clang and GCC.

For reference, I am currently using GCC, but would be willing to switch to Clang if it has a solution to this. If MSVC is the only compiler capable, I'll just go with the inlining option.

I created this simple example:

#include <immintrin.h>

struct HVA4 {
   __m256i data[4];
};

HVA4 __vectorcall example(HVA4 x) {
    x.data[0] = _mm256_permute4x64_epi64(x.data[0], 0b11001001);
    x.data[2] = _mm256_permute4x64_epi64(x.data[2], 0b00111001);

   return x;
}

which compiles to

vpermq  ymm0, ymm0, 201
vpermq  ymm2, ymm2, 57
ret

under MSVC 19.35 using /O2 /GS- /arch:avx2.

This is actually exactly what I want: my vector parameters are passed in proper SIMD registers, and are returned as such. The registers used even line up! From reading the MSDN docs , it sounds like I should be able to extend this to non-homogeneous aggregates as well, though even if not, I can make this work.

Clang is another story however. On 16.0.0 using -O3 -mavx2 it generates this absolute mess:

mov     rax, rcx
vpermpd ymm0, ymmword ptr [rdx], 201
vmovaps ymmword ptr [rdx], ymm0
vpermpd ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rdx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx + 32]
vmovaps ymm1, ymmword ptr [rdx + 96]
vmovaps ymmword ptr [rcx + 96], ymm1
vmovaps ymmword ptr [rcx + 32], ymm0
vmovaps ymm0, ymmword ptr [rdx + 64]
vmovaps ymmword ptr [rcx + 64], ymm0
vmovaps ymm0, ymmword ptr [rdx]
vmovaps ymmword ptr [rcx], ymm0
vzeroupper
ret

I'd show GCC's attempt, but it would probably double the size of this question.

The general idea with is the same, however; both GCC and Clang completely refuse to use multiple registers for SIMD return values, and only sometimes do so for parameters (they fare a lot better if the vectors are removed from the struct). While this may be expected behavior for standard calling conventions (I suspect they're actually following the SysV ABI at least for return value placement), vectorcall explicitly allows for it.

Of course, vectorcall is a non-standard attribute, just because two compilers have the same name for something doesn't mean they do the same thing, etc, but at least Clang specifically links to the MSDN docs, so I'd expect it follow them.

Is this simply a bug in clang? Just an unimplemented feature? (Again, it does link to the MSDN docs)

Furthermore, is there any way to achieve the optimizations given by MSVC in code like the example above, in either GCC or Clang, be it via a calling convention, or some compiler specific flag? I'd be happy to try writing a custom convention into the compiler, but that's pretty heavily out of scope for this project.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
the4naves
  • 333
  • 2
  • 9

1 Answers1

2

All the YMM registers are call-clobbered, so a non-inline function is kind of a showstopper for keeping any significant amount of data in registers. (The Windows x64 convention has call-preserved xmm6..15, but the wider YMM registers are still clobbered.) Quite a few integer registers are also call-clobbered, especially in the x86-64 System V calling convention (non-Windows).

If your program's valuable state is only those 4 vectors and a few integer registers, then yes, MSVC's x64 vectorcall can pass the vectors to non-inline functions and have them all returned as return values.

Otherwise, other state will have to get spilled/reloaded around the call, so the only good option for hand-written asm is GNU C inline asm.


x86-64 SysV returns 1 vector in x/y/zmm0

The x86-64 System V calling convention can return in at most 2 vector registers (xmm/ymm/zmm), like how integer args can be passed in up to 6 regs but only return in RDX:RAX.

But XMM1 is only used when returning an aggregate of scalar float or double (with a total size not exceeding 16 bytes, so the return value is in the low eightbyte of each of XMM0 and XMM1). The ABI doc's classification rule 5 (c) - If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory. - a second __m128i vector in a struct will have a second SSE-classed eightbyte. That's why such a struct returns in memory, rather than XMM0, XMM1. Rule 5c allows returning in YMM0 or ZMM0 for a single vector wider than 16 bytes (where all the later eightbytes are SSEUP), not other cases.

Testing confirms this. With struct { __m256i v[2]; }, GCC/clang return that in memory, not YMM0 / YMM1, see the Godbolt link below. But with struct { float v[3]; } we see v[4] being returned in element 1 of XMM1 (the top half of the low 64 bits = an eightbyte): Godbolt

So the AMD64 System V ABI's calling convention is not suited for your use case, even if it could return 2 vectors in vector regs.


vectorcall in GCC or clang: Different from MSVC, only 1 vector reg

You could declare a prototype for your asm function with __attribute__((ms_abi)) (gcc or clang) or __attribute__((vectorcall)) (clang only), but that doesn't actually seem to work the way you describe MSVC working: a struct of more than one __m256i gets returned in memory, by hidden pointer, even with vectorcall. (Godbolt)

A comment from Agner Fog on a GCC bug report (89485) says that clang targeting Windows does support __vectorcall, but that bug was just requesting GCC support for it at all, not discussing whether it returned multiple vectors in registers. Perhaps clang's implementation of __vectorcall isn't ABI-compatible with MSVC's for struct returns of multiple vectors?

I don't have Windows clang available for testing, or clang-cl which aims for more compat with MSVC.


asm("call foo" : "+x"(v0), ...); wrapper to also not clobber other regs

As you suggested in comments, you could invent your own calling convention and describe it to the compiler via inline asm. As long as it's a pure function, you can even avoid a "memory" clobber.

You do need to stop the compiler from using the red zone in the caller because call pushes a return address. See Inline assembly that clobbers the red zone

The compiler won't know it's a function call at all; the fact that your inline asm template happens to push/pop something on the stack is the important part, not that it jumps somewhere else before execution comes out the other side. The compiler doesn't parse the asm template string except to substitute %operands, like printf. It doesn't care if you reference an operand explicitly or not.

So you still have all the benefits and all the downsides of inline asm (https://gcc.gnu.org/wiki/DontUseInlineAsm), including having to precisely describe the outputs : inputs : clobbers to the compiler for the block of code you're running, like how you'd document in comments for hand-written asm helper functions.

Plus the overhead of a call and ret vs. writing your asm inside the asm statement itself. This seems very bad for something as cheap as two vpermq instructions. You could perhaps use asm(".include 'helper.s'" : "+x"(v0), ...); if you can split up your helpers one per file. (Or perhaps .set something that a .if can check for so you can ask for one block out of a file with multiple blocks? But that's probably harder to maintain.)

If you were using any "m" operands that might pick an addressing mode relative to RSP, that could also break as call pushes a return address. But you won't be in this case; you'll be forcing the compiler to pick specific registers for the operands instead of even giving it the choice of which YMM register to pick.

So it could perhaps look something like

#include <immintrin.h>

auto bar(__m256i v0_in, __m256i v1_in, __m256i v2_in, __m256i v3_in){
    // clang does pass args in the right regs for vectorcall
    // (after taking into account that the first arg-reg slot is taken by the hidden pointer because of disagreement about aggregate returns)
  register __m256i v0 asm("ymm0") = v0_in;  // force "x" constraints to pick a certain register for asm statements.
  register __m256i v1 asm("ymm1") = v1_in;
  register __m256i v2 asm("ymm2") = v2_in;
  register __m256i v3 asm("ymm3") = v3_in;

   v1 = _mm256_add_epi64(v1, v3);  // do something with the incoming args, just for example
    __m256i vlocal = _mm256_add_epi64(v0, v2);  // compiler can allocate this anywhere

    // declare some integer register clobbers if your function needs any
    // the fewer the better; the compiler can keep its own stuff in those regs otherwise
  asm("call asm_foo" : "+x"(v0), "+x"(v1), "+x"(v2), "+x"(v3) : : "rax", "rcx", "rdx");
  // if you don't compile with -mno-red-zone, then  "add $-128, %%rsp ; call ; sub $-128, %%rsp".
  //  But you don't want that each call inside a loop, so just use -mno-red-zone
    return _mm256_add_epi64(vlocal, v2);
}

Godbolt gcc and clang compile this to:

# clang16 -O3 -march=skylake -mno-red-zone

bar(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
        vpaddq  ymm1, ymm3, ymm1
        vpaddq  ymm4, ymm2, ymm0      # compiler happened to pick ymm4 for vlocal, a reg not clobbered by the asm statement.
# inline asm starts here
        call    asm_foo
# inline asm ends here
  # if we just return v2, we get  vmovaps ymm0, ymm2
        vpaddq  ymm0, ymm4, ymm2     # use ymm4 which was *not* clobbered by the inline asm statement,
                                     # along with the v2 = ymm2 output of the asm

        ret

vs. GCC being bad as usual at dealing with hard-register constraints on its register allocation:

# gcc13 -O3 -march=skylake -mno-red-zone

bar(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
        vmovdqa ymm5, ymm2      # useless copies, silly compiler.
        vmovdqa ymm4, ymm0
        vpaddq  ymm1, ymm1, ymm3
        vpaddq  ymm4, ymm4, ymm5
        call asm_foo
        vpaddq  ymm0, ymm4, ymm2
        ret

Whatever you were going to do in the asm_foo function, you could just as well have done it inside the asm template. And then you could use %0 instead of %%ymm0 to give the compiler a choice of registers. I lined up the variables with the incoming args to make it easy for the compilers.

asm_foo is the function that has the special calling convention. bar() is just a normal function whose callers will assume clobbers all the vector regs and half the integer regs, and can only return one vector by value.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for the response. I also ran into the exact same trouble attempting to have even just 2 vectors returned in registers. My only guess is that the 5.c clause is causing problems, as I *think* the first eightbyte of the second YMM would be classified as SSE, not SSEUP? I wouldn't quote myself on that though, and still doesn't explain why two XMM regs don't work. Clang just having a non ABI-compatible impl seems most likely. I'll test on windows later just to be sure though, and will also read through the call convention docs more. If only references could just be passed in registers :/ – the4naves May 13 '23 at 04:01
  • 1
    On a side note, I suppose I could *maybe* just essentially make up my own calling convention by creating a asm macro that calls the actual function, and then having an asm block at the beginning of the function that just makes my parameters visible to further code. I'd be a little worried about the compiler making assumptions based on the calling convention it believes it to be though. – the4naves May 13 '23 at 04:11
  • @the4naves: Yes, the first eightbyte of each `__m256i` should be classed as SSE, so it picks the next register from the set `ymm0, ymm1`. If that perhaps isn't happening, like if the struct is treated as one homogeneous thing that doesn't fit in a single register, then maybe that's the problem? But then IDK when you ever would be able to return something in X/YMM0 and X/YMM1. – Peter Cordes May 13 '23 at 04:22
  • 1
    @the4naves: Rolling your own calling convention and describing it to the compiler with an `asm("" : "+x"(v0), ... )` statement could work with `register __m256i v0 asm("ymm0")`, at least for pure helper functions that don't care about incoming stack alignment. But beware that there's [no way to tell the compiler about `call` clobbering the red zone](//stackoverflow.com/q/6380992) (below RSP), so you either need to `sub rsp, 128` / `call` / `add rsp, 128` (or `add rsp, -128` for a shorter imm8), or compile that file with `-mno-red-zone`. (Fortunately that doesn't change ABI compatibility.) – Peter Cordes May 13 '23 at 04:32
  • @the4naves: GCC and clang appear to agree with each other, and disagree with the AMD64 SysV ABI doc. Playing around on Godbolt, I didn't find any differences between them (except of course that GCC ignores `__attribute__((vectorcall))`. The only ABI incompatibility I found between two real compilers was between clang's `vectorcall` and MSVC's `vectorcall`. – Peter Cordes May 13 '23 at 04:33
  • As for the previous bit about eightbyte classification, 5.c states: "If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory". Would this not mean that two YMMs would trigger the "or any other eightbyte isn’t SSEUP" bit, or am I just missing how that clause is actually used? I'll probably try the assembly call just to see if there's any other catches I'm missing, as otherwise that would be a reasonable solution; thanks on the note about the red zone. – the4naves May 13 '23 at 04:47
  • @the4naves: The `asm("call":...)` trick means the code you write in a separate `.S` or `.asm` file *is* effectively still inline asm, it's just part of an asm template that pushes and pops and includes 2 jumps. You're wasting uops inside your hot loop by doing that instead of just writing it actually inline or using `.include` in the asm template. But as long as you accurately describe its inputs / outputs / clobbers to the compiler, the compiler doesn't know or care that it's a "function call". Of course with `-mno-red-zone` and no `"m"` operands that might use an RSP-relative addr modes.. – Peter Cordes May 13 '23 at 04:56
  • @the4naves: Hrm, well spotted about 5.c. So aggregates of two vectors are disqualified from register passing because they have two eightbytes that are SSE, not SSEUP. But a `struct {double v[2];}` can return in the low elements of XMM0 and XMM1 - https://godbolt.org/z/7Tb1a44nc . (GCC has a silly missed optimization where it returns `v[1]` with `vmovsd xmm0, xmm1, xmm1` instead of clang's `vmovaps xmm0, xmm1`. Or https://godbolt.org/z/9oz8Gb9he shows returning a struct of `float v[4]` in the low 64 bits each of XMM0 and XMM1 (two eightbytes). – Peter Cordes May 13 '23 at 05:04
  • @the4naves: updated my answer with the stuff we discussed in comments. – Peter Cordes May 13 '23 at 05:42
  • 1
    I've finally gotten around to testing on windows. GCC 11.2 doesn't even recognize `vectorcall`. However clang 15.0.7 with `target=x86_64-w64-windows-gnu` (to make it compile at all) works *perfectly*, generating the exact same assembly MSVC did in the above example! Unfortunately, I can't try cross compiling to anything as I don't have the relevant headers (or do but don't know it). Obviously this still isn't ideal; I'd like to have this performance on as many systems as possible, but it is a start. – the4naves May 14 '23 at 03:29
  • Also a couple notes on my exact use case; the inner loops themselves are the main hot areas (the example 2 shuffles are a big understatement). The outer loop is really just a glorified indirect call that switches between the inner loops whenever the speed of the new loop would likely offset the mispredict penalty. I don't know the exact cost for having a `call` immediately on the new path, but I could just replace the entire outer loop with assembly (a good idea anyway as all the loops should really be using the same call convention), thus avoiding any extra overhead. – the4naves May 14 '23 at 03:30
  • @the4naves: Interesting. Seems like a clang bug then, unless `__attribute__((vectorcall))` on System V targets is *supposed* to mean something different. Do you want to report it? – Peter Cordes May 14 '23 at 03:40
  • 1
    I'll report it yeah. If it's intended behavior it should probably be at least documented. – the4naves May 14 '23 at 03:55
  • 1
    Issue link is https://github.com/llvm/llvm-project/issues/62700. – the4naves May 14 '23 at 04:53
  • 1
    Quick update: Clang's `regcall` attribute appears to work for this situation on both linux+windows (https://godbolt.org/z/EG9P911rj as an example). `regcall` is probably even better for my situation as opposed to `vectorcall`, as it has a simply massive array of registers available for both parameters and return values. To be honest, I'm not quite sure how I missed this, guess I was just too focused on finding something that mentioned 'vector's specifically. – the4naves May 24 '23 at 21:02
  • 1
    `regcall` does still have trouble returning non-homogeneous aggregates, but passes them perfectly fine, which might be workable for my exact situation; I'll have to see. – the4naves May 24 '23 at 21:18