x86 Assembly How to properly get XMM0 into ST0?

Question

A wonderful Sunday everyone.

I am currently learning a lot of assembly in the 32-bit environment (currently Windows). I am using FASM for this.

I have the following code which I successfully made but I'm quite unhappy with the way I load XMM0 into ST0:

GetDistance: ;(__cdecl*)(float x1, float y1, float x2, float y2)
        push    ebp
        mov     ebp, esp        
        sub     esp, 0x4
        
        movss   xmm0, DWORD [ebp + 0x0014] ; Load x2
        subss   xmm0, DWORD [ebp + 0x000C] ; Subtract x1

        movss   xmm1, DWORD [ebp + 0x0010] ; Load y2
        subss   xmm1, DWORD [ebp + 0x0008]  ; Subtract y1

        mulss   xmm0, xmm0             ; Square of the x difference
        mulss   xmm1, xmm1             ; Square of the y difference

        addss   xmm0, xmm1             ; Sum of squared differences

        sqrtss  xmm0, xmm0             ; Square root
                                 
        movss   dword [ebp - 0x0004], xmm0
        fld     dword [ebp - 0x0004]    
        
        add     esp, 0x4
        
        pop     ebp
        ret     0

It does work but I have been googling for a straight 2 hours now (even asked ChatGPT) on how to get my XMM0 value into ST0 but I fail to search for the correct problem I guess and ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data section and therefore global variables and I think it leads me into a complete wrong direction.

I don't like that I had to use sub from and add to ESP to get XMM0 into ST0.

I also appreciate any tips to improve my code or even good resources to learn from it. I only want to focus 32-bit for now. :)

You don't need `ebp`, but if you do set up a stack frame you can use `leave` at the end instead of the `add`/`pop`. `fld` unfortunately needs a memory operand so there is no way around that unless of course if you use the fpu instructions instead of sse. — Jester, Aug 06 '23 at 12:45
If you insist on using 32-bit code, then you're stuck with legacy calling-conventions returning `float` in `st0` instead of `xmm0`. You could try `vectorcall`, though, to have the first several args passed in xmm registers, and return in xmm0: https://godbolt.org/z/aozfG5PTh . Or if you don't want to do that, the callee owns its incoming stack args, so you can use `[esp+4]` as scratch space to store/reload. As Jester said, you don't need to waste instructions setting up EBP. — Peter Cordes, Aug 06 '23 at 13:08
Yes, a trip through memory cannot be avoided when transferring data between X87 and SSE registers. — fuz, Aug 06 '23 at 13:25
@Jester i added leave now, thank you. I know I could get rid of the stackframe but i read somewhere that it is considered a bad practise? Tried to find where i read that but unfortunately I cannot. — Zvend, Aug 06 '23 at 13:45
@PeterCordes I never read about vectorcall before, its intresting! But I want to stick to cdecl to not confuse myself when i want to use the function. I can see the profit of use for lesser instructions but I mainly focus to write solid assembly code and actually learn from it. — Zvend, Aug 06 '23 at 13:45
@fuz Yeah i figured that out.. I was hoping for a neat trick. I think I have to learn using the st(0) instructions but when i started it today it confused me to be honest and the syntax for xmm0 instructions were way simpler (to me) — Zvend, Aug 06 '23 at 13:45
@EricPostpischil I usually agree on that but when i get stuck on finding my answers at google, then i tried asking chatGPT but always the same mistake. I always try to ask other people as less as possible — Zvend, Aug 06 '23 at 13:45
If you don't want to learn the legacy x87 FPU instructions right away, then you could work with 64-bit code where the standard calling convention passes/returns floats in XMM regs. The standard 32-bit calling conventions are old and bad, and use the x87 stack. — Peter Cordes, Aug 06 '23 at 13:59
Generally it's a good idea to work on solving your problem yourself, and to explain what you tried. But with ChatGPT, it's a waste of time to try it, and another waste of time to write in your post that it didn't work, and a third waste of time for others to read that it didn't work. — Nate Eldredge, Aug 06 '23 at 15:51

score 4 · Accepted Answer · edited Aug 10 '23 at 21:36

Store/reload is necessary to transfer from XMM to st0. Even though MMX registers alias the x87 registers, there's no way to use MOVDQ2Q mm0, xmm0 to get an 80-bit FP bit-pattern into st0, even apart from the problem of switching back from MMX to x87 state without clearing the registers.

Related: Intel x86_64 assembly, How to move between x87 and SSE2? (calculating arctangent of double)

You don't need to waste instructions setting up EBP as a frame pointer, though, especially in simple functions like this where it's easy enough to keep track of offsets relative to ESP.

In a function with stack args, the callee (your function) "owns" them, so you can use [esp+4] as scratch space instead of reserving new space. This is why, when calling the same function twice with the same args, the caller has to store the args again. e.g.

square:                   ; float square(float a); legacy cdecl convention
 movss  xmm0, [esp+4]
 mulss  xmm0, xmm0
 movss  [esp+4], xmm0      ; reuse the incoming arg as scratch space
 fld    dword [esp+4]
 ret

In this case it would have been more efficient to use fld dword [esp+4] / fmul st0 / ret because we're using a calling convention that returns in st0.

If you insist on using 32-bit code, then the default calling-conventions are old and bad, passing args on the stack and returning float/double in st0 instead of xmm0.

For Windows there are less bad 32-bit calling conventions, though. 32-bit vectorcall passes the first 6 FP (or SIMD vector) args in xmm registers, and returns in xmm0. And the first 2 integer args in regs like fastcall. (64-bit vectorcall only passes 4 args in XMM regs, differing from the standard Windows x64 convention only in handling types like __m128i and __m256.) See https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170 for more.

float _vectorcall 
 foo(float a, float b, float c, float d, float e, float f, float g, int i){
    return a+b+c+d+e+f+g + i;
}

Compiles with x86 MSVC 19.10 (Godbolt). It's a callee-pops convention like fastcall; note the ret 4 since we have one stack arg. If you don't have any stack args, though, just a normal ret is still correct.

_g$ = 8                                       ; size = 4
float foo(float,float,float,float,float,float,float,int) PROC                                ; foo, COMDAT
        addss   xmm0, xmm1
        movd    xmm1, ecx
        cvtdq2ps xmm1, xmm1          ; avoids a false dependency vs. cvtsi2ss xmm1, ecx which is also 2 uops
        addss   xmm0, xmm2
        addss   xmm0, xmm3
        addss   xmm0, xmm4
        addss   xmm0, xmm5
        addss   xmm0, DWORD PTR _g$[esp-4]   ; 7th FP arg comes from the stack.
                                             ; with _g$ = 8, this is actually [esp+4]
        addss   xmm0, xmm1                   ; +i  converted earlier
        ret     4
float foo(float,float,float,float,float,float,float,int) ENDP                                ; foo

If your callers are also hand-written asm, then you don't have to follow a standard calling convention; you can pass/return args in convenient registers and document it with comments on a per-function basis.

ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data section and therefore global variables and I think it leads me into a complete wrong direction.

Unsurprising; ChatGPT is very bad at assembly language, buggy code is normal.
It doesn't "understand" what it's doing in any language, but x86 asm was probably rarer in its training data and/or harder for large language models because the same register names and mnemonics get used in all programs. And there are so many different flavours of assembly language (including multiple for x86) that probably doesn't help.

Oh wow! That clarifies a lot, thanks a lot! I know my next question is a bit traverse but I want you to know that I am not lazy or any kind of that. I would like to see and ask how you would write my function `GetDistance` in 32 bit assembly. I learn a lot when somebody picks up my function and rewrites it with his professional experience. But you actually convinced me to get rid of the stackframe for smaller code. Secondly do you have a nice resource for my belonging? (32 bit assembly on windows) — Zvend, Aug 06 '23 at 14:48
@Zvend: Your code looks efficient to me, assuming you want to use SSE for scalar FP math instead of x87 when you're writing a small non-inline function that has to return in st0. I'd just change the `[ebp+...]` parts to `[esp+4 + ...]` after removing the stuff with EBP, and change how you bounce XMM0 to st0. Re: learning resources, see https://stackoverflow.com/tags/x86/info especially https://agner.org/optimize/. — Peter Cordes, Aug 06 '23 at 14:52
@Zvend: With args on the stack, we could consider using `movsd` to load them in pairs, allowing `subps xmm0, xmm1` / `mulps xmm0, xmm0`, bit then you'd need a shuffle to extract the 2nd float. Like SSE3 `movshdup xmm1, xmm0` / `addss xmm0, xmm1`. So less ILP (instruction-level parallelism), but saving a couple instructions, might be good for throughput. Or maybe not: if the caller stored the stack args with separate 32-bit stores, 64-bit loads would cause [store-forwarding stalls](//stackoverflow.com/a/69631247/224132) which out-of-order exec might not fully hide, depending on other code. — Peter Cordes, Aug 06 '23 at 14:57

x86 Assembly How to properly get XMM0 into ST0?

1 Answers1