How to prevent FTZ for a single line in CUDA

Question

I am working on a particle code where flushing-to-zero is extensively used to extract performance. However there is a single floating point comparison statement that I do not wish to be flushed. One solution is to use inline PTX, but it introduces unnecessary instructions since there is no boolean type, but just predicate registers, in PTX: C++ code:

float a, b;
if ( a < b ) do_something;
// compiles into SASS:
//     FSETP.LT.FTZ.AND P0, PT, A, B, PT;
// @P0 DO_SOMETHING

PTX:

float a, b;
uint p;
asm("{.reg .pred p; setp.lt.f32 p, %1, %2; selp %0, 1, 0, p;}" : "=r"(p) : "f"(a), "f"(b) );
if (p) do_something;
// compiled into SASS:
//     FSETP.LT.AND P0, PT, A, B, PT;
//     SEL R2, RZ, 0x1, !P0;
//     ISETP.NE.AND P0, PT, R2, RZ, PT;
// @P0 DO_SOMETHING

Is there a way that I can do the non-FTZ comparison with a single instruction without coding the entire thing in PTX/SASS?

I would compile the kernel into PTX, modify the regions of interest in the PTX file, and then compile the modified PTX. I have experienced similar to this problem for the `__ballot()` function. — Farzad, Apr 10 '15 at 16:25
It seems some context is missing. If `a` and `b` are the results of previous computation, and other than for this comparison the entire computation in the kernel is based on FTZ operations, the inputs to `FSETP` are already flushed to zero, in which case making the `FSETP` of non-FTZ kind does not make any difference. — njuffa, Apr 10 '15 at 16:57
@njuffa The reason is sort of complicated here but I guarantee that `a` and `b` are not flushed to zero. They may be very small floating point numbers though. — Rainn, Apr 10 '15 at 18:04
Depending on where that data is coming from (it is not computed inside the kernel, based on your comments), you could also look into pre-scaling this data to avoid subnormal operands to start with, or pre-computing the comparison. I assume you have already established that the couple of additional instructions caused by your current approach using inline PTX do in fact impact application performance? Depending on the numerical range of `a` and `b` you may be able to treat the data as integer (i.e. re-interpret as integers) and use an integer comparison instead of a floating-point comparison. — njuffa, Apr 10 '15 at 18:16
If both `a` and `b` are known to be positive, you can compare using `if (__float_as_int(a) < __float_as_int(b))`, because the ordering of positive floating-point numbers is equivalent to the ordering of their bit pattern interpreted as an integer (with the exception of NaN encodings). Since the GPU stores floating-point and integer data in the same registers, re-interpretation has no cost. — njuffa, Apr 10 '15 at 19:05
@njuffa Thanks for the suggestions! However for this particular case I think modifying the compiled PTX might be a more viable solution for me. — Rainn, Apr 11 '15 at 03:27

How to prevent FTZ for a single line in CUDA

0 Answers0

Linked