Store/reload is necessary to transfer from XMM to st0
. Even though MMX registers alias the x87 registers, there's no way to use MOVDQ2Q mm0, xmm0
to get an 80-bit FP bit-pattern into st0
, even apart from the problem of switching back from MMX to x87 state without clearing the registers.
Related: Intel x86_64 assembly, How to move between x87 and SSE2? (calculating arctangent of double)
You don't need to waste instructions setting up EBP as a frame pointer, though, especially in simple functions like this where it's easy enough to keep track of offsets relative to ESP.
In a function with stack args, the callee (your function) "owns" them, so you can use [esp+4]
as scratch space instead of reserving new space. This is why, when calling the same function twice with the same args, the caller has to store the args again. e.g.
square: ; float square(float a); legacy cdecl convention
movss xmm0, [esp+4]
mulss xmm0, xmm0
movss [esp+4], xmm0 ; reuse the incoming arg as scratch space
fld dword [esp+4]
ret
In this case it would have been more efficient to use fld dword [esp+4]
/ fmul st0
/ ret
because we're using a calling convention that returns in st0
.
If you insist on using 32-bit code, then the default calling-conventions are old and bad, passing args on the stack and returning float
/double
in st0
instead of xmm0
.
For Windows there are less bad 32-bit calling conventions, though. 32-bit vectorcall
passes the first 6 FP (or SIMD vector) args in xmm registers, and returns in xmm0
. And the first 2 integer args in regs like fastcall
. (64-bit vectorcall only passes 4 args in XMM regs, differing from the standard Windows x64 convention only in handling types like __m128i
and __m256
.) See https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170 for more.
float _vectorcall
foo(float a, float b, float c, float d, float e, float f, float g, int i){
return a+b+c+d+e+f+g + i;
}
Compiles with x86 MSVC 19.10 (Godbolt). It's a callee-pops convention like fastcall
; note the ret 4
since we have one stack arg. If you don't have any stack args, though, just a normal ret
is still correct.
_g$ = 8 ; size = 4
float foo(float,float,float,float,float,float,float,int) PROC ; foo, COMDAT
addss xmm0, xmm1
movd xmm1, ecx
cvtdq2ps xmm1, xmm1 ; avoids a false dependency vs. cvtsi2ss xmm1, ecx which is also 2 uops
addss xmm0, xmm2
addss xmm0, xmm3
addss xmm0, xmm4
addss xmm0, xmm5
addss xmm0, DWORD PTR _g$[esp-4] ; 7th FP arg comes from the stack.
; with _g$ = 8, this is actually [esp+4]
addss xmm0, xmm1 ; +i converted earlier
ret 4
float foo(float,float,float,float,float,float,float,int) ENDP ; foo
If your callers are also hand-written asm, then you don't have to follow a standard calling convention; you can pass/return args in convenient registers and document it with comments on a per-function basis.
ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data
section and therefore global variables and I think it leads me into a complete wrong direction.
Unsurprising; ChatGPT is very bad at assembly language, buggy code is normal.
It doesn't "understand" what it's doing in any language, but x86 asm was probably rarer in its training data and/or harder for large language models because the same register names and mnemonics get used in all programs. And there are so many different flavours of assembly language (including multiple for x86) that probably doesn't help.