Can you do C++-style CUDA?

Question

I follow some guidelines to deal with memory management in C++. Some examples: I never use malloc. I almost never need or use new or delete. I use smart pointers, and almost never need to write destructors.

I want to learn CUDA. I have been looking online for tutorials that match my C++ style of programming, but everything looks C-style. It is not clear to me when this C style of programming is necessary and when it is just the style of the author. As an example, here is a snippet of code from a NVIDIA tutorial:

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

//...

  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

This code uses malloc, free, owning raw pointers, and C-style arrays. Are these all necessary? Can I write modern C++-style CUDA?

CUDA uses a C interface, so you are are basically stuck writing a lot of C like code when working with it. — NathanOliver, Apr 05 '22 at 19:37
You may write your Cuda version smart pointer classes, to wrap that stuff. And maybe overriding `new` and `delete` operators for those special Cuda memory pointer would help. — πάντα ῥεῖ, Apr 05 '22 at 19:41
C++ smart pointers can have deleters and deleters can call cudaFree. — Öö Tiib, Apr 05 '22 at 19:42
Note that some libraries like thrust provide a C++ abstraction on top of CUDA and provide device-based vectors. The interaction with CUDA kernel is quite good but be aware that the performance of thrust primitive is not great (this is improving). — Jérôme Richard, Apr 05 '22 at 21:12

Robert Crovella · Accepted Answer · 2022-04-06T14:18:22.683

CUDA started out (over a decade ago) as a largely C style entity. Over time, the language migrated to be primarily a C++ variant/definition. For understanding, we should delineate the discussion between device code and host code.

For device code, CUDA claims compliance to a particular C++ standard, subject to various restrictions. One of the particular restrictions is that there is no general support for standard libraries.

For device code, (with some overlap with host code) there is an evolution underway to provide a set of STL-like libraries/features. But as an example, std::vector is not usable in CUDA device code (you can use new in CUDA device code).

For host code, there really isn't anything that is intended to be out-of-bounds, as long as we are talking about things that are strictly host code. The exceptions to this are undocumented issues that crop up from time to time for example with boost and perhaps many other libraries. These aren't intentional omissions, but arise via the fact that CUDA uses a special preprocessor/front-end, even for host code, coupled with incomplete testing against every imaginable library one might want to use.

It might also be worthwhile to say regarding user-supplied libraries (as opposed to standard libraries or system libraries) that CUDA generally requires functions to be decorated appropriately in order to be usable in device code. Whether we are talking about compiled libraries or header-only libraries, these should generally be usable in host code (subject to the caveat above), but not necessarily in device code, unless the library has been specifically decorated for CUDA usage.

Where host code is interfacing with device code, you'll need to follow the limitations fairly closely. Again, a std::vector container cannot be easily passed to a device code function call (a CUDA kernel). But as already mentioned in the comments, there is something similar you can do with the thrust library which is included with the CUDA toolkit install.

Are these all necessary?

malloc and free are not necessary. You can similarly use new and delete, or use the thrust containers.

regarding use of raw pointers and relatedly, C-style arrays, this will probably be more-or-less unavoidable, as these are part of C++ and there are no higher level containers in C++ apart from what is in standard libraries, AFAIK. Use of raw pointers at least at the host-device interface is certainly typical. If you use thrust::device_vector, for example, you will still need to extract a raw pointer to pass to the kernel.

The CUDA runtime and driver APIs still have largely a C-style feel to them. It's not formally part of CUDA, but others have created wrappers to make things more "C++ like". One such example is this library from einpoklum/eyalroz. I have no personal experience with it, but the maintenance of it seems to be relatively energetic, a going concern. And as hinted in the comments, via C++ overloads and e.g. replaceable functionality in various containers and library constructs, you can probably build a container or construct that does what you want, perhaps by replacing standard allocators, etc.

As already mentioned, thrust intends to provide a container/algorithm approach to leverage those kinds of C++ concepts in a CUDA environment.

It's not part of CUDA, but NVIDIA offers a way to accelerate standard C++ code also.

I'd be interested in a `cuda::valarray`. I can see why `std::valarray` stumbled, but GPU's look like the kind of processor where it makes sense. — MSalters, Apr 05 '22 at 22:22
A way of avoiding raw pointers at the interface is using a `span` implementation that provides the right function decorations, i.e. [gsl-lite](https://github.com/gsl-lite/gsl-lite). — paleonix, Apr 06 '22 at 14:31
And it might be worth to note that the `nvc++` promises not to need the decorators anymore which should in principle allow using many more libraries in GPU code. — paleonix, Apr 06 '22 at 14:34

Can you do C++-style CUDA?

1 Answers1