How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly?

Question

In a codebase of ours I found this snippet for fast, towards-negative-infinity¹ rounding on x87:

inline int my_int(double x)
{
  int r;
#ifdef _GCC_
  asm ("fldl %1\n"
       "fistpl %0\n"
       :"=m"(r)
       :"m"(x));
#else
  // ...
#endif
  return r;
}

I'm not extremely familiar with GCC extended assembly syntax, but from what I gather from the documentation:

r must be a memory location, where I'm writing back stuff;
x must be a memory location too, whence the data comes from.
there's no clobber specification, so the compiler can rest assured that at the end of the snippet the registers are as he left them.

Now, to come to my question: it's true that in the end the FPU stack is balanced, but what if all the 8 locations were already in use and I'm overflowing it? How can the compiler know that it cannot trust ST(7) to be where it left it? Should some clobber be added?

Edit I tried to specify st(7) in the clobber list and it seems to affect the codegen, now I'll wait for some confirmation of this fact.

As a side note: looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like

__asm__ __volatile__ ("fistpl %0"
                      : "=m" (retval)
                      : "t" (x)
                      : "st");

where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl); what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).

yes, it depends from the current rounding mode, which in our application should always be "towards negative infinity".

@MichaelPetch: ok that clears the last bit up, unfortunately, the documentation about this stuff is a bit difficult to approach "from the outside", especially the platform-specific parts. — Matteo Italia, Sep 27 '16 at 15:25
It could have been written without the clobber `st` by using `fistl` instead of `fistpl` but it would have required GCC to emit an additional instruction to pop off the value pushed by the `"t"` input constraint. GCC would usually (not always) have to do something like `fstp %st(0)` when your template is complete to remove what it pushed upon entry to your template. By having the assembler code in the template pop the value on the top of the stack with `fistpl` and listing `st` as a clobber it means that the template doesn't need to add an additional instruction to end the template. — Michael Petch, Sep 27 '16 at 21:23
It is bad enough that inline assembler is very tricky to understand and get correct (use it only when absolutely necessary). It is even worse when you deal with the peculiarities of how extended inline assembly interacts with the x87 FPU stack. With x86-64 you can get around this mess since you can use SIMD instructions and avoid for the most part interacting with the x87 FPU. — Michael Petch, Sep 27 '16 at 21:27
@MichaelPetch: yep, unfortunately this a 32 bit application with 16 years of history. Anyhow, I think that it's now definitely safe to enable at least SSE2 for it, which should free up from some of these problems. — Matteo Italia, Sep 29 '16 at 17:44
You should probably get rid of that inline asm, and let gcc emit it for you from C that has the desired semantics. You *definitely* don't want to require the input to be in memory, because that's just shooting yourself in the foot if it's the result of a calculation (so it will already be at the top of the FP stack). Letting gcc emit the code will let it choose whether to `fist` or `fistp`, if it still wants the FP value or not. — Peter Cordes, Oct 01 '16 at 09:34
And BTW, I don't think that `lrint` definition is actually used by newer gcc. Even with `-m32`, it can inline to an SSE instruction. [`lrint` inlines to a `cvtsd2si eax, xmm0` with just `-fno-math-errno`](https://godbolt.org/g/c4tlfC), even in 32-bit code (with -mfpmath=sse). With -ffast-math, nearbyint also inlines. — Peter Cordes, Oct 01 '16 at 09:47
@PeterCordes: actually, that piece of code caught my attention when, doing some post-mortem debugging, I noticed a straight `fstp`/`fldl` sequence, which made me question if gcc had gone mad. :-) It must be noted however that that piece of code comes a long way, it's the "straight port" in gcc/MinGW of what was used in VC++ 6 (where inline assembly isn't nearly as powerful), and that code itself came from an even earlier project, where it was written because of the deleterious performance of the infamous straight `int` cast on x86. — Matteo Italia, Oct 01 '16 at 12:54
Heh, yeah I got the impression from your question that the code had dubious origins. :P But then you were asking how to make it work, not how to replace it with something that would reliably compile to good code. So I felt the need to point out that that's possible, at least if you can use `-fno-math-errno`. — Peter Cordes, Oct 01 '16 at 12:57
Currently I just replaced the whole thing with the second snippet; as I said before, now we are going to discuss if it may be a problem to switch on at least SSE2 (IMO it shouldn't, however it should be noted that this is an embedded application which probably must still run on older embedded PCs with cheapo Celerons that were shipped a decade ago). `-fno-math-errno` is a thing I am considering, I don't think anybody in our codebase has expectations about errno; `-ffast-math` instead is too dangerous, we do have some quite fragile algorithms around that needed `-ffloat-store` with recent gcc. — Matteo Italia, Oct 01 '16 at 13:00
@PeterCordes: yep, thank you, these are all valid suggestions that I'm going to try out, if I can manage to get better performance *and* remove inline assembly it's always a win :-) . `-fno-math-errno` is seriously tempting, it always makes me cry when I look at the generated code and see all those `call` to libc for stuff that would be a single straight assembly instruction. — Matteo Italia, Oct 01 '16 at 13:04
@MatteoItalia: If your code never reads `errno` after math functions, it should have literally no change on your numerical results since you're using `-ffloat-store` to force rounding to 64-bit anyway. (which would happen during arg passing). It doesn't enable any "unsafe" math optimizations. — Peter Cordes, Oct 01 '16 at 13:10
Also, the most recent CPUs to *not* have SSE2 are AMD Athlon XP (generation before 64-bit K8). The most recent Intel CPU to *not* have SSE2 is PIII. I think even Celeron P4 CPUs had SSE2. `double` with SSE2 are always 64-bit, so you get the effect of `-ffloat-store` for free. Again, I'd guess you'll get literally identical results, unless there was still some 80-bit temporary that `-ffloat-store` missed. BTW, `-mfpmath=sse` doesn't change the ABI, so `double` args are still returned in `st(0)` :/ I think there's another option to change the ABI, at least for internal linkage functions. — Peter Cordes, Oct 01 '16 at 13:14

score 6 · Accepted Answer · answered May 26 '17 at 09:09

looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
                     : "=m" (retval)
                     : "t" (x)
                     : "st");
where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl)

This is actually the correct way to represent the code you want as inline assembly.

To get the most optimal possible code generated, you want to make use of the inputs and outputs. Rather than hard-coding the necessary load/store instructions, let the compiler generate them. Not only does this introduce the possibility of eliding potentially unnecessary instructions, it also means that the compiler can better schedule these instructions when they are required (that is, it can interleave the instruction within a prior sequence of code, often minimizing its cost).

what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).

The "st" clobber refers to the st(0) register, i.e., the top of the x87 FPU stack. What Intel/MASM notation calls st(0), AT&T/GAS notation generally refers to as simply st. And, as per GCC's documentation for clobbers, the items in the clobber list are "either register names or the special clobbers" ("cc" (condition codes/flags) and "memory"). So this just means that the inline assembly clobbers (overwrites) the st(0) register. The reason why this clobber is necessary is that the fistpl instruction pops the top of the stack, thus clobbering the original contents of st(0).

The only thing that concerns me regarding this code is the following paragraph from the documentation:

Clobber descriptions may not in any way overlap with an input or output operand. For example, you may not have an operand describing a register class with one member when listing that register in the clobber list. Variables declared to live in specific registers (see Explicit Register Variables) and used as asm input or output operands must have no part mentioned in the clobber description. In particular, there is no way to specify that input operands get modified without also specifying them as output operands.

When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers. As a result, clobbered registers are available for any use in the assembler code.

As you already know, the t constraint means the top of the x87 FPU stack. The problem is, this is the same as the st register, and the documentation very clearly said that we could not have a clobber that specifies the same register as one of the input/output operands. Furthermore, since the documentation states that the compiler is forbidden to use any of the clobbered registers to represent input/output operands, this inline assembly makes an impossible request—load this value at the top of the x87 FPU stack without putting it in st!

Now, I would assume that the authors of glibc know what they are doing and are more familiar with the compiler's implementation of inline assembly than you or I, so this code is probably legal and legitimate.

Actually, it seems that the unusual case of the x87's stack-like registers forces an exception to the normal interactions between clobbers and operands. The official documentation says:

On x86 targets, there are several rules on the usage of stack-like registers in the operands of an asm. These rules apply only to the operands that are stack-like registers:

Given a set of input registers that die in an asm, it is necessary to know which are implicitly popped by the asm, and which must be explicitly popped by GCC.

An input register that is implicitly popped by the asm must be explicitly clobbered, unless it is constrained to match an output operand.

That fits our case exactly.

Further confirmation is provided by an example appearing in the official documentation (bottom of the linked section):

This asm takes two inputs, which are popped by the fyl2xp1 opcode, and replaces them with one output. The st(1) clobber is necessary for the compiler to know that fyl2xp1 pops both inputs.
asm ("fyl2xp1" : "=t" (result) : "0" (x), "u" (y) : "st(1)");

Here, the clobber st(1) is the same as the input constraint u, which seems to violate the above-quoted documentation regarding clobbers, but is used and justified for precisely the same reason that "st" is used as the clobber in your original code, because fistpl pops the input.

All of that said, and now that you know how to correctly write the code in inline assembly, I have to echo previous commenters who suggested that the best solution would be not to use inline assembly at all. Just call lrint, which not only has the exact semantics that you want, but can also be better optimized by the compiler under certain circumstances (e.g., transforming it into a single cvtsd2si instruction when the target architecture supports SSE).

Thank you for taking the time to give a proper answer to my old-ish question :-) ; the bit about input registers that are also in clobber list seems to be the key, I'll try later to figure out some different cases (`fchs` => input in st0, output in st0, no clobber; `fdivp` => input in st0 and st1, output in st0, pops st1; and most importantly, `fxtract` => input in st0, output in st0 *and then* push of another bit of output) and see if I'm still missing something. — Matteo Italia, May 26 '17 at 09:53

How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly?

1 Answers1

Linked

Related