What are the best settings for stuff like MXCSR? Which rounding mode is fastest? On what processors? Is it faster to enable signalling NaNs so I get informed when a computation results in a nan, or does this cause slowdowns in non-NaN computations?
In summary, how do you get the maximum of speed out of tight inner SSE loops?
Any related x87 floating-point speed advice also welcome.