2

The cos() in math.h run faster than the x86 asm fcos.

The following code is compare between the x86 fcos and the cos() in math.h.

In this code, 1000000 times asm fcos cost 150ms; 1000000 times cos() call cost only 80ms.

How is the fcos implemented in x86? Why is the fcos much slower than cos()?

My enviroment is intel i7-6820HQ + win10 + visual studio 2017.

#include "string"
#include "iostream"
#include<time.h>
#include "math.h"

int main()
{
  int i;
  const int i_max = 1000000;

  float c = 10000;
  float *d = &c;

  float start_value = 8.333333f;
  float* pstart_value = &start_value;
  clock_t a, b;
  a = clock();

  __asm {
    mov edx, pstart_value; 

    fld [edx];
  }

  for (i = 0; i < i_max; i++) {
    __asm {
        fcos;
    }
  }


  b = clock();
  printf("asm time = %u", b - a);

  a = clock();
  double y;
  for (i = 0; i < i_max; i++) {
    start_value = cos(start_value);
  }

  b = clock();
  printf("math time = %u", b - a);
  return 0;
}

According to my personal understanding, a single asm instruction is usually faster than a function call. Why in this case the fcos so slow?


Update: I have run the same code on another laptop with i7-6700HQ. On this laptop the 1000000 times fcos cost only 51ms. Why there is such a big difference between the two cpus.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
TBD
  • 81
  • 4
  • 1
    Look at the assembly to see what happens with `cos()`. – Shawn Apr 13 '19 at 13:16
  • At least tag this with what implementation you're using since, aside from the wacky style inline asm being implementation-specidic, performance properties of the math library and dead code elimination will be too. – R.. GitHub STOP HELPING ICE Apr 13 '19 at 13:52
  • 1
    its probably uses the Streaming SIMD Extensions instead of the old, not further optimised x87-instructions. – sivizius Apr 13 '19 at 18:43
  • 2
    Which i7 model? That spans a range from Nehalem in ~2008 (`fcos` decodes to ~100 uops, and takes 40 to 100 clock cycles) to Coffee Lake in 2019 (fcos decodes to 53 to 105 uops and takes 50 to 130 clock cycles) https://agner.org/optimize/. Or even Cannon Lake... And with a significant range in performance for the SSE2 instructions that math library `cos()` probably uses. – Peter Cordes Apr 14 '19 at 01:33
  • @Peter Cordes My Cpu is i7 6820HQ. And today I tried to run the same code on a laptop with i7 6700HQ. On the 6700HQ laptop, the asm fcos cost 51ms. Why there is such a big difference from the two cpus. – TBD Apr 14 '19 at 10:31
  • Partly Intel did not put a lot of silicon into making `fcos` fast (presumably they did not think its usage justified that) and it is largely serial in that another `fcos` cannot make much progress while a prior one is executing. A software `cos` may consist of multiple parts—initial code that classifies the operand, then code that performs argument reduction, then code that evaluates a polynomial approximation, then code that recombines parts. Some of those parts can operate in parallel with others of the same call (the polynomial itself may be constructed from two or more polynomials)… – Eric Postpischil Apr 14 '19 at 12:48
  • … and some parts can operate in parallel with parts from a prior or subsequent call to `cos`, so you may get more overlap with software `cos` than with hardware `fcos`. Also, when writing Apple’s `cosf`, I took advantage of SSE to get some more parallelism even within a single `cosf` call. (I do not recall doing that in the `cos` call, but I would have to go back to the source to check.) Being a software routine, it was free to spread itself out over any registers the ABI did not require be preserved. – Eric Postpischil Apr 14 '19 at 12:50
  • What are those numbers in core clock cycles? (Not RDTSC reference cycles unless you disable turbo; either do that and add some CPU warmup code, or measure with HW performance counters). Could one of your CPUs be ramping up to max turbo a lot faster? Do both cases get the same numeric answer? `fcos` performance is data-dependent. It might matter if old software was setting the x87 precision to only a 52-bit or 23-bit significand (speeding up `fdiv` and `fsqrt` uops, in case the microcode for `fcos` uses any of those.) – Peter Cordes Apr 14 '19 at 15:55

1 Answers1

1

I bet the answer is easy. You do not use the result of cos and it is optimized out as in this example

https://godbolt.org/z/iw-nft

Change the variables to volatile to force cos call.

https://godbolt.org/z/9_dpMs

Another guess: Maybe your cos implementation uses lookup tables. Then it will be faster than the hardware implementation.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
0___________
  • 60,014
  • 4
  • 34
  • 74
  • Thank you, but I have checked the disassemble and The cos call is not optimized. Moreover, does the standard c math lib of visual studio use a lookup tables? – TBD Apr 14 '19 at 10:48