0

I'm experimenting with using nvcc generated PTX file in an OpenCL-based host program. A simple testing kernel is written in both CUDA C and OpenCL C shown below.

// exampleKernel_cuda.cu
extern "C"
{
    __global__ void exampleKernel(float *input, float *output)
    {
        // Get the thread id
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        output[tid] = input[tid];
    }
}
// exampleKernel_opencl.cl
__kernel void exampleKernel(__global float *input, __global float *output) 
{
    int tid = get_global_id(0);
    output[ti] = input[tid];
}

The cuda version is then compiled to PTX codes by

nvcc -arch=sm_50 -ptx exampleKernel_cuda.cu

The OpenCL C version PTX is retrieved from a different program by using getInfo<CL_PROGRAM_BINARIES>

The host program then creates cl::Program objects with each PTX file as input and builds from them. Device buffers are allocated and copied using cl::Buffer and enqueueRead/WriteBuffer.

The result is that the OpenCL C kernel works as expected. CUDA C kernel gives empty output. Adding some printf() calls in the CUDA C kernel show that the kernel is launched successfully - the problem should come from the argument pointers.

Suspecting that the kernel arguments are the source of the problem, I looked into the PTX codes and found out the difference of whether the kernel arguments being .ptr .global or not.

// PTX for CUDA C
.visible .entry exampleKernel(
    .param .u64 exampleKernel_param_0,
    .param .u64 exampleKernel_param_1
)
// PTX for OpenCL C
.entry exampleKernel(
    .param .u64 .ptr .global .align 4 exampleKernel_param_0,
    .param .u64 .ptr .global .align 4 exampleKernel_param_1
)

I manually added .ptr .global into the CUDA C kernel argument, and then it works as expected.

With all these being said, is there a way to compile the cuda kernel into a similar pattern, where at the PTX level, kernel arguments point to the global space such that the nvcc-generated PTX can be used in OpenCL?

If not, I assume CUDA and OpenCL allocate device buffers differently, making the kernel arguments at the PTX level point to different memory scopes. Is my assumption correct, and is there any document about this?

talonmies
  • 70,661
  • 34
  • 192
  • 269
MCx
  • 1
  • Read https://stackoverflow.com/q/67620646/681865 -- there is no intentional inoperability between PTX for OpenCL runtimes and PTX for CUDA runtimes, and there is no documentation for this unintentional inoperability to the extent it actually works – talonmies Sep 03 '23 at 03:34

0 Answers0