How to use macros or C++ templates to allow reusable code in both host and device code?

Question

I'm writing a hobby raytracer in CUDA and C++ and I'm running into an issue that I haven't been able to get an answer on. I have written CPU and GPU code such that it can execute on machines with or without CUDA-capable devices. However, this has lead to some code duplication in the following sense: A small set of functions require random number generation, which is achieved with stdlib on host and curand on device. I would love to have __host__ __device__ functions that take a Sampler struct that either calls rand() on host or curand_uniform() on device. I've tried some things but can't get the program to compile - the compiler complains about not calling __device__ functions from __host__ code and vice versa.

Ideally I'd like my rendering functions to take a Sampler * which looks something like the code below.

Thanks!

struct Sampler {
    __host__ virtual float getNextFloat() { return rand() / (RAND_MAX + 1.f); }
};

struct CudaSampler : Sampler { 
    curandState* p_curandState;
    __device__ float getNextFloat() { return curand_uniform(p_curandState); }
};

score 2 · Accepted Answer · answered Jan 09 '19 at 18:32

What you're asking for should be possible. We don't want to attempt to overload a function separately via __host__ and __device__ (not allowed) and we don't want to attempt to do this using inheritance and virtual functions (the virtual function table will not be usable in such an object passed from host to device).

But if we avoid those issues, the basic idea is to use the __CUDA_ARCH__ nvcc macro to differentiate the host and device path for the compiler, generally following what is suggested here.

Here's one possible method, roughly following your outline:

$ cat t34.cu
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <curand_kernel.h>


struct Sampler {
  __host__ __device__ float operator()(curandState *s){
#ifdef __CUDA_ARCH__
    return curand_uniform(s);
#else
    return rand()/(float)RAND_MAX;
#endif
  }
};

__global__ void init_rng(curandState *state, size_t n){
        size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
        if (idx < n)
          curand_init(1234, idx, 0, state+idx);
}

__global__ void gpu_sample(Sampler s, curandState *state, size_t n){
        size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
        if (idx < n)
                printf("gpu id: %lu, val: %f\n", idx, s(state+idx));
}

__host__  void cpu_sample(Sampler s){
        curandState dummy;
    std::cout << "cpu: " << s(&dummy) << std::endl;
}

int main(){
        int n = 1;
        int nTPB = 256;
        curandState *s;
        Sampler my_op;
        cudaMalloc(&s, n*sizeof(curandState));
        init_rng<<<(n+nTPB-1)/nTPB, nTPB>>>(s,n);
        gpu_sample<<<(n+nTPB-1)/nTPB, nTPB>>>(my_op, s, n);
        cudaDeviceSynchronize();
        cpu_sample(my_op);
}



$ nvcc -o t34 t34.cu
$ cuda-memcheck ./t34
========= CUDA-MEMCHECK
gpu id: 0, val: 0.145468
cpu: 0.840188
========= ERROR SUMMARY: 0 errors
$

Thank you very much, this answers the question I had – Tudor Jan 09 '19 at 19:39 — Tudor, Jan 09 '19 at 19:39

How to use macros or C++ templates to allow reusable code in both host and device code?

1 Answers1