1

(Followup question for Compile-time ceiling function, for literals, in C?)

Considering the following CUDA function:

__device__ int foo_f() { return ceilf(1007.1111); }

It should be easy to optimize this to produce a device function which simply returns 1008:

mov.u32         %r1, 1008;
st.param.b32    [func_retval0+0], %r1;
ret;

but instead, it compiles (using NVCC 11.5) into the costlier:

mov.f32          %f1, 0f447C0000;
cvt.rzi.s32.f32  %r1, %f1;
st.param.b32     [func_retval0+0], %r1;
ret;

The optimization is also missed if the code is:

static __device__ int poor_mans_ceilf(float x)
{
    return (int) x + ( ((float)(int) x < x) ? 1 : 0);
}

__device__ int foo_pf() { return poor_mans_ceilf(1007.1111); }

which should be even easier for the compiler to "notice" as an optimization opportunity.

So, why is NVCC failing to make the optimization here (while typical C/C++ compilers do take it)? Is there some subtle hitch preventing the optimization in (edit) PTX code? I realize that the ptxas has its own chance of optimizing this away eventually, but this is not a microarchitecture-specific optimization.

See it all on GodBolt.

PS: I know that this might be circumvented by using constexpr.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • 4
    This will be handled by the device code optimizing assembler, `ptxas`, or, alternatively, the JIT mechanism. Analyzing PTX gives an incomplete picture of what `nvcc` will do, especially when your question pertains to device code. Study the sass code then see if you can find any evidence of `0x3f0` there. You might ask yourself at that point why is that integer constant there? You asked specifically about: "in device-side code?" Here is what I would say: "PTX is not device-side code". – Robert Crovella Feb 14 '22 at 01:10

1 Answers1

6

As you are no doubt fully aware, PTX is a virtual assembly language and isn't run by the GPU. If we compile your code to machine code, we see this:

$ cat bogogogo.cu
__device__ int foo_f() { return ceilf(1007.1111); }

$ nvcc -dc -Xptxas='-v' bogogogo.cu
ptxas info    : 0 bytes gmem
ptxas info    : Function properties for _Z5foo_fv
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

$ cuobjdump -sass bogogogo.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

        code for sm_52
                Function : _Z5foo_fv
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                       /* 0x001ffc00ffe007f0 */
        /*0008*/         {         MOV32I R4, 0x3f0 ;  /* 0x010000003f07f004 */
        /*0010*/                   RET         }
                                                       /* 0xe32000000007000f */
        /*0018*/                   BRA 0x18 ;          /* 0xe2400fffff87000f */
                                                       /* 0x001f8000fc0007e0 */
        /*0028*/                   NOP;                /* 0x50b0000000070f00 */
        /*0030*/                   NOP;                /* 0x50b0000000070f00 */
        /*0038*/                   NOP;                /* 0x50b0000000070f00 */

You can clearly see that the conversion has been optimized to an immediate constant (0x3f0 = 1008 = ceilf(1007.1111)) in the final assembler output. So the optimization you are demanding is performed, but by the PTX assembler and not by the front end C++ compiler.

If you want to know why the NVIDIA toolchain designers decided to perform the optimization work split between compiler and assembler in this fashion, you would have to ask them directly.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • 2
    I suppose I was assuming the split was based on what you can optimize fully without knowing anything hardware-specific, and what you can't. And mine is an example of something you really don't need to wait for ptxas to be able to optimize. – einpoklum Feb 14 '22 at 07:59