Can't use my template class in cuda kernel

Question

I thought I knew how to write some clean cuda code. Until I tried to make a simple template class and use it in a simple kernel. I've been trouble shooting for days. Every single thread I've visited made me feel a little more stupid.

For error checking I used this

Here is my class.h:

#pragma once
template <typename T>
class MyArray
{
public:
    const int size;

    T *data;

    __host__ MyArray(int size); //gpuErrchk(cudaMalloc(&data, size * sizeof(T)));

    __device__ __host__ T GetValue(int); //return data[i]
    __device__ __host__ void SetValue(T, int); //data[i] = val;
    __device__ __host__ T& operator()(int); //return data[i];

    ~MyArray(); //gpuErrchk(cudaFree(data));
};

template class MyArray<double>;

The relevant content of class.cu is in the comments. If you think the whole thing is relevant Id be happy to add it.

Now for the main class:

__global__ void test(MyArray<double> array, double *data, int size)
{
    int j = threadIdx.x;
        //array.SetValue(1, j);  //doesn't work
        //array(j) = 1;  //doesn't work
        //array.data[j] = 1; //doesn't work
        data[j] = 1;   //This does work !
        printf("Reach this code\n");
    }
}
int main(int argc, char **argv)
{
    MyArray x(20);
    test<<<1, 20>>>(x, x.data, 20);

    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());
}

When I say "doesn't work", I mean that the program stops there (before reaching the printf) without outputting any error. Plus I get the following error both from cudaDeviceSynchronize and from cudaFree:

an illegal memory access was encountered

What I can't understand is that there should be no issue with memory management since sending the array directly to the kernel works fine. So why doesn't it work when I send a class and try to access the classes data? And why do I receive no warning or error message when clearly my code bumped into some error?

Here is the output of nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

I can't test it right now but I would suggest not taking MyArray as reference but by value instead. See this question: https://stackoverflow.com/questions/8302506/parameters-to-cuda-kernels — Devon Cornwall, Nov 29 '19 at 10:10
`x` is in host memory, and you can't access an object in host memory from the GPU. Note that the `data` member can't be used on the host since the underlying memory is allocated on the device. I would recommend some more reading about host and device memory. — molbdnilo, Nov 29 '19 at 10:22
You need to create x on host assigning T* data on GPU in constructor or another function. Then copy x from host to GPU using cudaMemcpy() and then send it to the kernel for it to work. The error you are getting has nothing to do with the class being a template. — Saisai3396, Nov 29 '19 at 10:50
@DevonCornwall Thanks, I changed to pass by value. It doesn't work yet but at least I know its not that the trouble is elsewhere. — Cydouzo, Dec 02 '19 at 05:28
@molbdnilo Accoarding to the official CUDA documentation, there should be no need to copy x in device memory explicitely, and passing it by value is enough (see my edit for further explanation) — Cydouzo, Dec 02 '19 at 05:28
I think you're misunderstanding the documentation; your object needs to reside in device memory in order to be used in a device kernel. Refer to this post for a working example of passing an array of objects to a kernel: https://devtalk.nvidia.com/default/topic/1037242/pass-pointer-to-class-as-a-kernel-argument-and-access-class-methods/ — Michael, Dec 04 '19 at 21:33
@Michael Thank you, that post had the exact information that I was missing! — Cydouzo, Dec 05 '19 at 05:52

score 2 · Accepted Answer · edited Aug 02 '20 at 20:02

(Editorial note: there is quite a bit of disinformation in the comments on this question, so I have assembled an answer as a community wiki entry.)

There is no particular reason why a template class cannot be passed as an argument to a kernel. There are some limitations which need to be clearly understood before doing so:

CUDA kernel arguments, for all intent and purpose, always passed by value. Pass by reference is supported under extremely limited set of circumstances (the argument in question must be stored in managed memory). This does not apply here.
As a result of (1), POD arguments just work, because they are trivially copyable and rely on no special behaviour
Classes are different, in that when you pass a class by value, you are implicitly invoking copy construction or move construction semantics. That means that classes passed by value as kernel arguments must be trivially copy constructable. There is no way to run non trivial copy constructors on the device as part of a kernel launch.
CUDA further requires that classes don't contain virtual members
Although the <<< >>> kernel launch syntax looks like a simple function call, it isn't. There is several layers of abstraction boilerplate and a API call between what you write in host code and what actually is emitted by the toolchain on the host side. This means that there are several copy construction operations between your code and the GPU. If you do something like put a cudaFree call in your destructor, you should assume that it will get called as part of the function call sequence which launches a kernel when one of those copies falls out of scope. You do not want that.

You did not show how the class member functions were actually implemented in this case, so saying why one of the many permutations your code comments hinted at did or did not work is impossible, beyond passing the raw pointer to the kernel, which works because it is a trivially copyable POD value, when the class was almost certainly not.

Here is a simple, complete example showing how to make this work:

$cat classy.cu
#include <vector>
#include <iostream>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

template <typename T>
class MyArray
{
    public:
        int len;
        T *data;

        __device__ __host__ void SetValue(T val, int i) { data[i] = val; };
        __device__ __host__ int size() { return sizeof(T) * len; };

        __host__ void DevAlloc(int N) {
            len = N;
            gpuErrchk(cudaMalloc(&data, size()));
        };

        __host__ void DevFree() {
            gpuErrchk(cudaFree(data));
            len = -1;
        };
};

__global__ void test(MyArray<double> array, double val)
{
    int j = threadIdx.x;
    if (j < array.len)
        array.SetValue(val, j);
}

int main(int argc, char **argv)
{
    const int N = 20;
    const double val = 5432.1;

    gpuErrchk(cudaSetDevice(0));
    gpuErrchk(cudaFree(0));

    MyArray<double> x;
    x.DevAlloc(N);

    test<<<1, 32>>>(x, val);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    std::vector<double> y(N);
    gpuErrchk(cudaMemcpy(&y[0], x.data, x.size(), cudaMemcpyDeviceToHost));
    x.DevFree();

    for(int i=0; i<N; ++i) std::cout << i << " = " << y[i] << std::endl;

    return 0;
}

which compiles and runs like so:

$ nvcc -std=c++11 -arch=sm_53 -o classy classy.cu
$ cuda-memcheck ./classy
========= CUDA-MEMCHECK
0 = 5432.1
1 = 5432.1
2 = 5432.1
3 = 5432.1
4 = 5432.1
5 = 5432.1
6 = 5432.1
7 = 5432.1
8 = 5432.1
9 = 5432.1
10 = 5432.1
11 = 5432.1
12 = 5432.1
13 = 5432.1
14 = 5432.1
15 = 5432.1
16 = 5432.1
17 = 5432.1
18 = 5432.1
19 = 5432.1
========= ERROR SUMMARY: 0 errors

(CUDA 10.2/gcc 7.5 on a Jetson Nano)

Note that I have included host side functions for allocation and deallocation which do not interact with the constructor and destructor. Otherwise the class is extremely similar to your design and has the same properties.

Can't use my template class in cuda kernel

1 Answers1