In my OpenCL kernel I use 16bit floating point values of type half
from the cl_khr_fp16
extension.
Although this gives me code that works well, I noticed with AMD's radeon developer tools that the reciprocal is computed in 32 bits (gpu target is gfx1102 RDNA3.)
So the value is first converted from half precision to single precision, then the reciprocal is computed, and then the result is converted back into half precision.
This is despite having the division with both numerator and denominator in half precision.
I know that CUDA uses a function call for this: hrcp so I also tried the following OpenCL reciprocal functions half_recip() / native_recip() with the same results.
Is there a way to force OpenCL to compute the reciprocal without first converting?