(Followup question for Compile-time ceiling function, for literals, in C?)
Considering the following CUDA function:
__device__ int foo_f() { return ceilf(1007.1111); }
It should be easy to optimize this to produce a device function which simply returns 1008:
mov.u32 %r1, 1008;
st.param.b32 [func_retval0+0], %r1;
ret;
but instead, it compiles (using NVCC 11.5) into the costlier:
mov.f32 %f1, 0f447C0000;
cvt.rzi.s32.f32 %r1, %f1;
st.param.b32 [func_retval0+0], %r1;
ret;
The optimization is also missed if the code is:
static __device__ int poor_mans_ceilf(float x)
{
return (int) x + ( ((float)(int) x < x) ? 1 : 0);
}
__device__ int foo_pf() { return poor_mans_ceilf(1007.1111); }
which should be even easier for the compiler to "notice" as an optimization opportunity.
So, why is NVCC failing to make the optimization here (while typical C/C++ compilers do take it)? Is there some subtle hitch preventing the optimization in (edit) PTX code? I realize that the ptxas
has its own chance of optimizing this away eventually, but this is not a microarchitecture-specific optimization.
See it all on GodBolt.
PS: I know that this might be circumvented by using constexpr
.