pow(NAN) is very slow

Question

What is the reason for the catastrophic performance of pow() for NaN values? As far as I can work out, NaNs should not have an impact on performance if the floating-point math is done with SSE instead of the x87 FPU.

This seems to be true for elementary operations, but not for pow(). I compared multiplication and division of a double to squaring and then taking the square root. If I compile the piece of code below with g++ -lrt, I get the following result:

multTime(3.14159): 20.1328ms
multTime(nan): 244.173ms
powTime(3.14159): 92.0235ms
powTime(nan): 1322.33ms

As expected, calculations involving NaN take considerably longer. Compiling with g++ -lrt -msse2 -mfpmath=sse however results in the following times:

multTime(3.14159): 22.0213ms
multTime(nan): 13.066ms
powTime(3.14159): 97.7823ms
powTime(nan): 1211.27ms

The multiplication / division of NaN is now much faster (actually faster than with a real number), but the squaring and taking the square root still takes a very long time.

Test code (compiled with gcc 4.1.2 on 32bit OpenSuSE 10.2 in VMWare, CPU is a Core i7-2620M)

#include <iostream>
#include <sys/time.h>
#include <cmath>

void multTime( double d )
{
   struct timespec startTime, endTime;
   double durationNanoseconds;

   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startTime);

   for(int i=0; i<1000000; i++)
   {
      d = 2*d;
      d = 0.5*d;
   }

   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endTime);
   durationNanoseconds = 1e9*(endTime.tv_sec - startTime.tv_sec) + (endTime.tv_nsec - startTime.tv_nsec);
   std::cout << "multTime(" << d << "): " << durationNanoseconds/1e6 << "ms" << std::endl;
}

void powTime( double d )
{
   struct timespec startTime, endTime;
   double durationNanoseconds;

   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startTime);

   for(int i=0; i<1000000; i++)
   {
      d = pow(d,2);
      d = pow(d,0.5);
   }

   clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endTime);
   durationNanoseconds = 1e9*(endTime.tv_sec - startTime.tv_sec) + (endTime.tv_nsec - startTime.tv_nsec);
   std::cout << "powTime(" << d << "): " << durationNanoseconds/1e6 << "ms" << std::endl;
}

int main()
{
   multTime(3.14159);
   multTime(NAN);

   powTime(3.14159);
   powTime(NAN);
}

Edit:

Unfortunately, my knowledge on this topic is extremely limited, but I guess that the glibc pow() never uses SSE on a 32bit system, but rather some assembly in sysdeps/i386/fpu/e_pow.S. There is a function __ieee754_pow_sse2 in more recent glibc versions, but it's in sysdeps/x86_64/fpu/multiarch/e_pow.c and therefore probably only works on x64. However, all of this might be irrelevant here, because pow() is also a gcc built-in function. For an easy fix, see Z boson's answer.

Can you figure out whether it's a hardware or a library problem? — Kerrek SB, Jul 24 '14 at 09:06
I have (modern laptop, linux, gcc 4.9) : 17, 6, 70, 13 ms. Do you have access to a more recent compiler / library ? I wouldn't be surprise if your VM has something to do with this. — quantdev, Jul 24 '14 at 09:11
@quantdev, I agree, I can't reproduce his results. I suspect it's his VM. Mayb it's throwing an exception for NAN like denormalized numbers. — Z boson, Jul 24 '14 at 09:17
Unfortunately I'm limited to this completely outdated version of gcc and glibc (2.5), forced to run it in a VM and have no access to a compiler outside the VM The only thing that came to mind was trying it in MATLAB, which confirms your results and probably your suspicions about the VM being responsible... — dasdingonesin, Jul 24 '14 at 09:22
hold on, GCC 4.1 is ancient! It came on in 2007. It's 7 years old! — Z boson, Jul 24 '14 at 09:24
And OpenSuSE 10.2 is ancient as well. It came out in 2008. Upgrade your system. Use a 64-bit OS as well. — Z boson, Jul 24 '14 at 09:27
@quantdev, I can't reproduce his results with VirtualBox either. I do get the first part but compiling in 64-bit or using SSE fixes that as he sees but I can't reproduce the problem with `powTime(nan)`. — Z boson, Jul 24 '14 at 09:28
As I said, there is nothing I can do about the outdated toolchain / environment. — dasdingonesin, Jul 24 '14 at 09:29

score 8 · Accepted Answer · answered Jul 24 '14 at 09:34

8

"NaNs should not have an impact on performance if the floating-point math is done with SSE instead of the x87 FPU."

I'm not sure this follows from the resource you quote. In any case, pow is a C library function. It is not implemented as an instruction, even on x87. So there are 2 separate issues here - how SSE handles NaN values, and how a pow function implementation handles NaN values.

If the pow function implementation uses a different path for special values like +/-Inf, or NaN, you might expect a NaN value for the base, or exponent, to return a value quickly. On the other hand, the implementation might not handle this as a separate case, and simply relies on floating-point operations to propagate intermediate results as NaN values.

Starting with 'Sandy Bridge', many of the performance penalties associated with denormals were reduced or eliminated. Not all though, as the author describes a penalty for mulps. Therefore, it would be reasonable to expect that not all arithmetic operations involving NaNs are 'fast'. Some architectures might even revert to microcode to handle NaNs in different contexts.

answered Jul 24 '14 at 09:34

Brett Hale

21,653
2
61
90

Hi CPU is Sandy Bridge. It's his ancient math library that's the problem. – Z boson Jul 24 '14 at 09:48
@Zboson - Have you looked at the implementation of the `pow` function in the C library for the OP's platform? How do you know a Haswell CPU might not handle the same code with *no* penalties? – Brett Hale Jul 24 '14 at 09:58
I don't know 100% but in this case I trust [Fat Tony more than Dr. John](https://en.wikipedia.org/wiki/Ludic_fallacy) – Z boson Jul 24 '14 at 10:02
@Zboson - I think the real [Dr. John](https://www.youtube.com/watch?v=HT4RainY-lY) is more instructive. – Brett Hale Jul 24 '14 at 10:23
Haha that's pretty funny:-) – Z boson Jul 24 '14 at 10:48

score 3 · Answer 2 · answered Jul 24 '14 at 10:05

Your math library is too old. Either find another math library which implements pow with NAN better or implement a fix like this:

inline double pow_fix(double x, double y) 
{
    if(x!=x) return x;
    if(y!=y) return y;
    return pow(x,y);
}

Compile with g++ -O3 -msse2 -mfpmath=sse foo.cpp.

score 2 · Answer 3 · answered Jul 24 '14 at 11:06

2

If you want to do squaring or taking the square root, use d*d or sqrt(d). The pow(d,2) and pow(d,0.5) will be slower and possibly less accurate, unless your compiler optimizes them based on the constant second argument 2 and 0.5; note that such an optimization may not always be possible for pow(d,0.5) since it returns 0.0 if d is a negative zero, while sqrt(d) returns -0.0.

For those doing timings, please make sure that you test the same thing.

answered Jul 24 '14 at 11:06

vinc17

2,829
17
23

`d*d` and `sqrt()` are indeed the right tools for the job of squaring and taking the square root, but I was interested in the performance of `pow()`, the powers of 2 and 0.5 are just arbitrarily chosen. I could also have used 2.5 and 0.4 or anything else. – dasdingonesin Jul 24 '14 at 12:52
1

@dasdingonesin If you want to do timings on `pow()`, you shouldn't use the constant 2 as the second argument. For instance, GCC 4.9 optimizes it to `mulsd` (a multiplication) as soon as `-O` is used (first level of optimization). However it seems that you didn't use optimizations (but this is also a bad idea when doing timings, since optimizations are almost always used when compiling real code). The builtin can be disabled in various ways, such as with `-fno-builtin`. – vinc17 Jul 24 '14 at 13:16
@dasdingonesin A good implementation of pow() is full of special cases for all sorts of different values so you shouldn't expect '2' and '0.5' to be representative of the performance of pow in general. – Bruce Dawson Feb 23 '15 at 02:23

score 2 · Answer 4 · answered Feb 23 '15 at 02:28

With a complex function like pow() there are lots of ways that NaN could trigger slowness. It could be that the operations on NaNs are slow, or it could be that the pow() implementation checks for all sorts of special values that it can handle efficiently, and the NaN values fail all of those tests, leading to a more expensive path being taken. You'd have to step through the code to find out for sure.

A more recent implementation of pow() might include additional checks to handle NaN more efficiently, but this is always a tradeoff -- it would be a shame to have pow() handle 'normal' cases more slowly in order to accelerate NaN handling.

My blog post only applied to individual instructions, not complex functions like pow().

pow(NAN) is very slow

4 Answers4