Why does NVCC not optimize away ceilf() for literals?

Question

_{(Followup question for Compile-time ceiling function, for literals, in C?)}

Considering the following CUDA function:

__device__ int foo_f() { return ceilf(1007.1111); }

It should be easy to optimize this to produce a device function which simply returns 1008:

mov.u32         %r1, 1008;
st.param.b32    [func_retval0+0], %r1;
ret;

but instead, it compiles (using NVCC 11.5) into the costlier:

mov.f32          %f1, 0f447C0000;
cvt.rzi.s32.f32  %r1, %f1;
st.param.b32     [func_retval0+0], %r1;
ret;

The optimization is also missed if the code is:

static __device__ int poor_mans_ceilf(float x)
{
    return (int) x + ( ((float)(int) x < x) ? 1 : 0);
}

__device__ int foo_pf() { return poor_mans_ceilf(1007.1111); }

which should be even easier for the compiler to "notice" as an optimization opportunity.

So, why is NVCC failing to make the optimization here (while typical C/C++ compilers do take it)? Is there some subtle hitch preventing the optimization in (edit) PTX code? I realize that the ptxas has its own chance of optimizing this away eventually, but this is not a microarchitecture-specific optimization.

See it all on GodBolt.

_{PS: I know that this might be circumvented by using constexpr.}

This will be handled by the device code optimizing assembler, `ptxas`, or, alternatively, the JIT mechanism. Analyzing PTX gives an incomplete picture of what `nvcc` will do, especially when your question pertains to device code. Study the sass code then see if you can find any evidence of `0x3f0` there. You might ask yourself at that point why is that integer constant there? You asked specifically about: "in device-side code?" Here is what I would say: "PTX is not device-side code". — Robert Crovella, Feb 14 '22 at 01:10

talonmies · Answer 1 · 2022-02-15T00:11:06.143

As you are no doubt fully aware, PTX is a virtual assembly language and isn't run by the GPU. If we compile your code to machine code, we see this:

$ cat bogogogo.cu
__device__ int foo_f() { return ceilf(1007.1111); }

$ nvcc -dc -Xptxas='-v' bogogogo.cu
ptxas info    : 0 bytes gmem
ptxas info    : Function properties for _Z5foo_fv
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

$ cuobjdump -sass bogogogo.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

        code for sm_52
                Function : _Z5foo_fv
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                       /* 0x001ffc00ffe007f0 */
        /*0008*/         {         MOV32I R4, 0x3f0 ;  /* 0x010000003f07f004 */
        /*0010*/                   RET         }
                                                       /* 0xe32000000007000f */
        /*0018*/                   BRA 0x18 ;          /* 0xe2400fffff87000f */
                                                       /* 0x001f8000fc0007e0 */
        /*0028*/                   NOP;                /* 0x50b0000000070f00 */
        /*0030*/                   NOP;                /* 0x50b0000000070f00 */
        /*0038*/                   NOP;                /* 0x50b0000000070f00 */

You can clearly see that the conversion has been optimized to an immediate constant (0x3f0 = 1008 = ceilf(1007.1111)) in the final assembler output. So the optimization you are demanding is performed, but by the PTX assembler and not by the front end C++ compiler.

If you want to know why the NVIDIA toolchain designers decided to perform the optimization work split between compiler and assembler in this fashion, you would have to ask them directly.

I suppose I was assuming the split was based on what you can optimize fully without knowing anything hardware-specific, and what you can't. And mine is an example of something you really don't need to wait for ptxas to be able to optimize. — einpoklum, Feb 14 '22 at 07:59

Why does NVCC not optimize away ceilf() for literals?

1 Answers1