0

I have this old code to transform spherical to Cartesian 3D coordinates :

TDVector3D Cartesian3D_asm(const double &Theta, const double &Phi)
{
  TDVector3D V;
  __asm__
  {
    mov    eax,[ebp+0x0C]
    mov    edx,[ebp+0x10]
    fld     qword ptr [eax]  // ST0=T     Theta
    fsincos                  // ST1=sin(T)  ST0=cos(T)
    fxch    ST(1)            // ST1=cos(T)  ST0=sin(T)
    fld     qword ptr [edx]  // ST2=cos(T)  ST1=sin(T)  ST0=P   Phi
    fsincos                  // ST3=cos(T)  ST2=sin(T)  ST1=sin(P) ST0=cos(P)
    fmul    ST,ST(2)         // ST3=cos(T)  ST2=sin(T)  ST1=sin(P) ST0=cos(P)*sin(T)
    fstp    qword ptr V.X    // ST2=cos(T)  ST1=sin(T)  ST0=sin(P)

    fmulp   ST(1),ST         // ST1=cos(T)  ST0=sin(P)*sin(T)
    fstp    qword ptr V.Y    // ST0=cos(T)

    fstp    qword ptr V.Z    // Coprocesseur vide
    fwait
  }
  return V;
}

with this TDVector3D struct :

typedef struct TDVector3D {
        double X, Y, Z;
        TDVector3D(double x, double y, double z): X(x), Y(y), Z(z) { }
} TDVector3D;

The no assembler code is :

TDVector3D Cartesian3D(const double &Theta, const double &Phi)
{
  double X, Y, Z;
  X = Y = sin(Theta);
  X *= cos(Phi);
  Y *= sin(Phi);
  Z = cos(Theta);
  return TDVector3D(X, Y, Z);
}

I found this sample for SinCos :

void SinCos(double Theta, double *sinT, double *cosT)
{
    __asm__ ("fsincos" : "=t" (*cosT), "=u" (*sinT) : "0" (Theta));
}

I try to convert my old code but I am totally lost with "u", "0", "t" (who is who regarding ST0, ST1, ...).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

1 Answers1

0

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html - "=t" means that output picks the Top of Stack register, st(0). "0" picks the same location as operand 0, a "matching constraint", so also st(0), which makes sense because that's the input for fsincos. "=u" is unsurprisingly st(1), the other output.

IDK why you want to use inline asm at all, though, instead of just letting the compiler optimize sincos() (a GNU extension) from <math.h> / <cmath>; a math library function call is fine, and maybe faster than the x87 instruction. Might even auto-vectorize if Cartesian3D is called in a loop, producing 2 or 4 results with about the same amount of work as one.

Also, why take a double by const reference? It's small enough to pass by value. BTW, the main reason to use x87 in modern code would be for 80-bit extended precision. If you need that, still just use sincosl. GCC might inline it to a fsincos instruction, or might call a library function which might still be faster; https://agner.org/optimize times fsincos at 60-120 uops, with 60-140 cycle latency (strangely no throughput reported.)

Also, if you do insist on using asm to force it to run an x87 fsincos, you don't need to convert it to one bit asm() statement, you can just call that working SinCos() twice, for two separate inputs, and the compiler will take care of loads / stores and fxchg. Well, I thought you wouldn't need to, but GCC and clang do a rather poor job in 64-bit mode, wanting to bounce data back to XMM registers for the multiplies. https://godbolt.org/z/1j9df4cro whether you pass by value (forcing it to spill XMM registers to memory first if it doesn't inline) or by reference.

Even in 32-bit mode it was auto-vectorizing the multiply I think, wanting to use mulpd. https://godbolt.org/z/sTd43zono shows GCC -m32 -O3 -mfpmath=387 -fno-tree-vectorize making efficient asm using your wrapper function, with about the same number of instructions as your inline asm{} block.

static inline void SinCos_x87(double Theta, double *sinT, double *cosT)
{
    __asm__ ("fsincos" : "=t" (*cosT), "=u" (*sinT) : "0" (Theta));
}

#if 1
 #define SINCOS SinCos_x87
#else
 #include <math.h>
 #define SINCOS sincos
#endif

TDVector3D Cartesian3D_sincos(const double Theta, const double Phi)
{
    double sinTheta, cosTheta, sinPhi, cosPhi;
    SINCOS(Theta, &sinTheta, &cosTheta);
    SINCOS(Phi, &sinPhi, &cosPhi);
    double X = sinTheta * cosPhi;
    double Y = sinTheta * sinPhi;
    double Z = cosTheta;
    return TDVector3D(X, Y, Z);
}

Unlike MSVC's inefficient style of asm{} block, GNU C inline asm can efficiently wrap a single instruction, and tell the compiler what registers to place the inputs and find the outputs. There is no overhead, so as long as the compiler can generate efficient code between the inlined asm("":::) statements, there's no benefit to having one more complicated asm statement.

That is the case for 32-bit mode without auto-vectorization, but not for 64-bit mode (unless you also use -mfpmath=387 for that compilation unit! Godbolt)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Hi, Many many thanks for your answer. I learn lot of things. So finally, I don't need to write complicated asm code, just trust the compiler to make the best code. Just add -mfpmath=387 to my line of command. Thanks again, Lionel. – Lionel_stack Mar 18 '23 at 14:53
  • @Lionel_stack: Using `-mfpmath=387` in general will make your *other* code slower, especially If you're compiling for 64-bit mode. So you don't want to use it for all your files. You should just call the libm function `sincos()` which has the same function signature as your `SinCos_x87`, especially if you aren't compiling for 32-bit mode or using 80-bit `long double`. (You might want to benchmark the asm version against one that calls `sincos`). – Peter Cordes Mar 19 '23 at 03:58
  • 1
    Indeed, that's exactly what I saw when I have added `-mfpmath=387`. So, I just use what you have proposed :) But I made another test with just the sincos of GNU, and it seems to be more faster (but surprisingly, I don't see the call of fsincos in the dissassembly code). – Lionel_stack Mar 19 '23 at 13:50
  • @Lionel_stack: Right, that's what I said in my answer, `fsincos` isn't a fast instruction. It's more efficient to implement an algorithm for approximating `sin`/`cos` to about the precision of a `double` using your own basic operations, not the sequence of micro-ops you get from the microcoded `fsincos` instruction. That's what math libraries do. And a vector math library can do it for two or four doubles in parallel if they avoid branching, using packed-double instructions like `mulpd` instead of just scalar, so you can hopefully get two separate `sincos` results for about the price of one. – Peter Cordes Mar 19 '23 at 19:41
  • 1
    Thanks peter for your patience ;). I have learned lot of thinks and I will take a look to the vector class library of Agner. There is an header file to compute sincos. – Lionel_stack Mar 21 '23 at 09:13
  • @Lionel_stack: Yeah, for sure, if GCC / clang aren't auto-vectorizing for you with glibc's libmvec or anything else, then explicit vectorization could be very helpful. https://sourceware.org/glibc/wiki/libmvec . Apparently GCC only auto-vectorizes math functions with `-ffast-math`; maybe the vector versions aren't as precise or don't handle NaN and/or subnormals or something. – Peter Cordes Mar 21 '23 at 10:46
  • Well, it become complicated to find the best compromise ! Finally, I think that -Ofast include all (like -ffast-math) and depending the processor, we can add -maxv ou -mavx2. [link](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options) – Lionel_stack Mar 23 '23 at 13:57
  • @Lionel_stack: Use `-Ofast -march=native` if you want to make binaries tuned for your CPU, and using any/all extensions it supports. – Peter Cordes Mar 23 '23 at 14:12
  • Super, thanks again Peter for this information ;) – Lionel_stack Mar 24 '23 at 17:25