2

I am using IEEE strict floating point arithmetic in a CFD solver. My algorithm is explicit and deterministic (it will perform the exact same number of computations at each time step). Yet I am observing that during periods when solution is...complicated (lots of stuff happening all over the grid), the solver "bogs down" and goes slower.

I'm very n00bish to FP arithmetic, and therefore assumed that FP arithmetic was deterministic in its computation time. Now I'm beginning to wonder if there the number of actual CPU operations required for a given FP calculation can depend on the values (e.g., strongly different exponents for a multiply or divide).

Can IEEE-strict floating-point performance depend on value?

Andreus
  • 2,437
  • 14
  • 22
  • 1
    Short answer: yes. The differences may not always be noticeable, but that probably depends on the hardware being used (if any - if there is no supporting hardware, you will almost certainly notice a difference in timing dependent on value). – Rudy Velthuis Jul 25 '14 at 17:44
  • 3
    Typically, non-trivial arithmetic "all over the grid" involves a lot of memory access. The memory access performance can be affected by what else the system is running and the placement in physical memory of different virtual pages. – Patricia Shanahan Jul 25 '14 at 17:49
  • 2
    Functions of floating point numbers (log, pow, etc) can all correctly take different lengths of times for different operands. Even simpler operations dealing with special cases (underflow, NaN) can take different lengths of time. Simplest thing is just to profile the calculation for a couple of "simple" iterations and a couple "complex" iterations and see where the differences are. – Jonathan Dursi Jul 25 '14 at 17:49
  • 1
    Arithmetic on subnormal numbers is, on some CPUs, much, much slower than on normal numbers. Infinities and NaNs are special cases too. – tmyklebu Jul 25 '14 at 17:52
  • PatriciaShanahan, I don't think that's the culprit (this time). Unless somehow the placement in physical memory is the exactly same for dozens of different runs, which I doubt. @Rudy, tmyklebu: I would think my CPU/hardware is pretty darn good; it's a 2013 Xeon. What hardware would be better? Jonathan: I'll try that. Thanks, folks! Pretty sure that answers my question. – Andreus Jul 25 '14 at 18:28
  • 1
    @Andreus: I assume with "good" you mean "fast". So your hardware may be fast, but there can still be considerable timing differences depending on the numerical values operated on. On the other end, hardware where it doesn't make a lot of differences could very well be slow. One aspect is orthogonal to the other, AFAICT. – Rudy Velthuis Jul 25 '14 at 18:34
  • 1
    Would your computations be likely to produce subnormal numbers (that is, very close to zero but non-null numbers) by any chance? – Pascal Cuoq Jul 25 '14 at 18:57
  • I think some profiling and measurement is in order. Perhaps compare otherwise similar runs with fast and slow times? – Patricia Shanahan Jul 25 '14 at 19:42
  • @Andreus: It has nothing to do with "good" or "darn" or "better." Subnormals are a special case that are handled in a certain way that differs from the way normal numbers are handled. They're rare, and CPU vendors usually have something better to do with the silicon that would make subnormals go fast. – tmyklebu Jul 26 '14 at 00:28
  • If you can give a reproducible test case, we can probably explain the performance characteristics. – Tavian Barnes Jul 26 '14 at 19:46
  • My conclusion thus far: Floating point arithmetic is hard. So far as I can tell, it is not a denormal problem. I have tried every variation of flush-to-zero and treat-as-zero, with and without IEEE strict, ..... Nothing changes this odd behavior. Further, it only seems to happen with the OpenMP multithreaded version of my application. – Andreus Jul 28 '14 at 21:25
  • I cannot simplify this to a reproducible test case (at least not in a reasonable amount of time) as the situation is tied to a large CFD simulation. – Andreus Jul 28 '14 at 21:27
  • Thanks everyone for the pointers; it has lead to a lot of learning regarding FP representation, issues, and performance. If I figure out exactly what was causing this, I will post. – Andreus Jul 28 '14 at 21:28

1 Answers1

1

While it would be possible to design floating-point hardware that whose execution speed for any particular operation would be independent of the values of the operands, it is generally advantageous to minimize the average-case time especially if that can be done without affecting the worst-case time. For example, even if a chip would normally require six cycles to perform a double-precision floating-point multiply, performance in many applications could be improved if, at the same time as the chip started the multiplication process, a separate circuit did the following:

Set R1 if first operand is NaN or second operand is +/- 1.0
Set R2 if second operand is NaN or first operand is +/- 1.0
Set Z if either operand is +/- 0.0
Set N if either operand is NaN
If (R1 or R2 or Z)
  Set the body of the result, excluding sign, to the (first-op & R1) | (second-op & R2)
  Set the sign of the result to (first-op & (R1 | !N)) ^ (second-op & (R2 | !N))
  Skip the rest of the multiplication

Adding the above logic would cause floating-point multiplies by +/- 1.0, or +/- 0.0 to be performed in a sixth of the time required for multiplication not involving such constants. There are many scenarios where code accepts arbitrary scaling factors but is most often used with scaling factors of zero or one; some graphics applications, for example, might allow arbitrary scaling, rotation, and shear but be used most frequently with a scale factor of one, no rotation, and no shear. Expediting multiplication by zero and one, despite requiring less hardware than would be required to improve most multiplications by a cycle, could in many scenarios offer a more useful performance boost.

supercat
  • 77,689
  • 9
  • 166
  • 211