How slow is NaN arithmetic in the Intel x64 FPU?

Question

Hints and allegations abound that arithmetic with NaNs can be 'slow' in hardware FPUs. Specifically in the modern x64 FPU, e.g on a Nehalem i7, is that still true? Do FPU multiplies get churned out at the same speed regardless of the values of the operands?

I have some interpolation code that can wander off the edge of our defined data, and I'm trying to determine whether it's faster to check for NaNs (or some other sentinel value) here there and everywhere, or just at convenient points.

Yes, I will benchmark my particular case (it could be dominated by something else entirely, like memory bandwidth), but I was surprised not to see a concise summary somewhere to help with my intuition.

I'll be doing this from the CLR, if it makes a difference as to the flavor of NaNs generated.

@zneak: At the very least, IEEE-754 defines "quiet" and "signaling" NaNs with different bit patterns. — Jim Lewis, Aug 31 '10 at 04:51
zneak: I believe that 32-bit floats have 2^24 different NaN values, but since you check them with `isNan()` and not `==`, it doesn't really matter exactly what bit pattern a given NaN has. In fact, I don't believe the Intel FPUs even generate NaNs; they generate exceptions and exception handlers can return NaNs. — Gabe, Aug 31 '10 at 05:06

zneak · Accepted Answer · 2010-08-31T13:56:38.697

6

For what it's worth, using the SSE instruction mulsd with NaN is pretty much exactly as fast as with the constant 4.0 (chosen by a fair dice roll, guaranteed to be random).

This code:

for (unsigned i = 0; i < 2000000000; i++)
{
    double j = doubleValue * i;
}

generates this machine code (inside the loop) with clang (I assume the .NET virtual machine uses SSE instructions when it can too):

movsd     -16(%rbp), %xmm0    ; gets the constant (NaN or 4.0) into xmm0
movl      -20(%rbp), %eax     ; puts i into a register
cvtsi2sdq %rax, %xmm1         ; converts i to a double and puts it in xmm1
mulsd     %xmm0, %xmm1        ; multiplies xmm0 (the constant) with xmm1 (i)
movsd     %xmm1, -32(%rbp)    ; puts the result somewhere on the stack

And with two billion iterations, the NaN (as defined by the C macro NAN from <math.h>) version took about 0.017 less seconds to execute on my i7. The difference was probably caused by the task scheduler.

So to be fair, they're exactly as fast.

edited Aug 31 '10 at 13:56

answered Aug 31 '10 at 04:45

zneak

134,922
42
253
328

Its always nice to see someone actual profile results, but the OP asked for a concise summary and particular not for a benchmarked solution. So -1. sorry. FWIW, I did a benchmark on the VS.2015 cl.exe compiler and running inside a mono runtime (Unity 5.5.2, in fact) and found the isNaN test to be ORDERS of magnitude slower. So only because you found one example where its fast doesn't answer the question whether its generally as fast. – Imi Apr 20 '17 at 06:56
@Imi, have you looked at the code that it generated? Also, this answer doesn't have an isNaN anywhere. – zneak Apr 20 '17 at 06:58
"this answer doesn't have an isNaN anywhere." -- yes, sorry. I bunched my comment with two relatively unrelated points together. I critic your answer providing a specific measurement to a question especially asking for a "concise summary". Then I present another example (starting like you with "FWIW" and one specific measurement only) where IsNaN happen to be slower (which you didn't claim to be faster). So my first point still stands: a broader summary guideline would be nice. My second only back up your conclusion of "Probably not use IsNaN instead of direct float-operations". – Imi Apr 20 '17 at 10:44

How slow is NaN arithmetic in the Intel x64 FPU?

1 Answers1

Linked