I was reading about the math coprocessor (Paul Carters PC Assembly Book) and its instructions to make floating point calculations (on ASM i386). Then I ran into the following code that is supposed to return the larger double of two given double values (C Calling Convention):
1 %define d1 ebp+8
2 %define d2 ebp+16
3 global dmax
4
5 segment .text
6 dmax:
7 enter 0,0
8
9 fld qword [d2]
10 fld qword [d1] ;Now ST0 = d1 and ST1 = d2
11 fcomip st1 ;Compares ST0 with ST1 and pops ST0 out
12 jna short d2_bigger ;If not above (ST0<ST1)
13 fcomp st0 ;Get rid of ST0, which is actually d2 now (line 11)
14 fld qword [d1]
15 jmp short exit
16 d2_bigger:
17 exit:
18 leave
19 ret
There were two things I was thinking about changing on this code. First, I'd probably use FCOMI
instead of FCOMIP
on the comparison (line 11) to avoid 1 unnecessary coprocessor register pop. Doing this, if ST0=ST1 there would be no pop at all (since it is already in the top of the stack). The only reason I can see for not doing it would be that it would leave a unempty stack of the coprocessor registers. However, I think the only relevant value for C is ST0, which would be the return value of the double function. If another function pushed more than 8 float/double values to the coprocessor stack, wouldn't the values stored in the lowest members of the coprocessor stack (ST7) just be discarded? So is it really an issue to leave a function without clearing the coprocessor stack? => (READ EDIT)
The second thing I was thinking of changing is I'd probably not use the instruction FCOMP
on line 13. I understand the reason it is there is to pop ST0 out of the stack to make ST1 reach the top. However, I think it's a bit of an overhead to make a whole comparison and setting the coprocessor flags just to pop the value. I looked for a instruction only for poping ST0 and apparently there is none. I thought it would be faster though to use FADDP ST0, ST0
(adds ST0 to ST0 and pops ST0 out) or FSTP ST0
(stores the value of ST0 to ST0 and pops ST0 out). They just look in my head like less work for the coprocessor.
I tried to test the speed of the 3 options (the one on the code above, FSTP ST0
and FADDP ST0, ST0
) and after a few quick tests they all ran with very similar speeds. Kind of unaccurate to make a conclusion out of the values. Apparently the FADDP ST0,ST0
was a bit faster, followed by the FSTP ST0
and finally the FCOMP ST0
. Is there a recommendation on which one to use? Or am I bothering too much about something that will have such a negligible effect on the overall speed?
I just questioned myself because since Assembly is about doing things the fastest way possible, maybe choosing between one of those approaches could have a benefit.
EDIT:
I was reading the Intel 64 and IA-32 Instruction Set Reference and apparently the coprocessor throws an exception if the stack overflows or underflows (Exception #IS). So using the stack and not emptying it (in this case, leaving only the ST0 so C will pop the return value of it) is not an option apparently.