What's the 'right' way to implement a 32-bit memset for CUDA?

Question

CUDA has the API call

cudaError_t cudaMemset (void *devPtr, int value, size_t count)

which fills a buffer with a single-byte value. I want to fill it with a multi-byte value. Suppose, for the sake of simplicity, that I want to fill devPtr with a 32-bit (4-byte) value, and suppose we can ignore endianness. Now, the CUDA driver has the following API call:

CUresult cuMemsetD32(CUdeviceptr dstDevice, unsigned int ui, size_t N)

So is it enough for me to just: obtain the CUdeviceptr from the device-memory-space pointer, then make the driver API call? Or is there something else I need to be doing?

talonmies · Accepted Answer · 2014-03-18T16:57:05.253

1

As of about CUDA 3.0, runtime API device pointers (and everything else) are interoperable with the driver API. So yes, you can use cuMemsetD32 to fill a runtime API allocation with a 32 bit value. The size of CUdeviceptr will match the size of void *on you platform and it is safe to cast a pointer from the CUDA API to CUdeviceptr or vice versa.

edited Mar 18 '14 at 16:57

answered Mar 18 '14 at 15:46

talonmies

70,661
34
192
269

But a `CUdeviceptr` is an unsigned int, isn't it? can I just cast it as a `void*` ? – einpoklum Mar 18 '14 at 16:30
2

Yes, on 32-bit operating systems CUdeviceptr is an unsigned int (unsigned long long on 64-bit systems), but you can cast it to void* or what ever type your array is. – kunzmi Mar 18 '14 at 16:39
@einpoklum: See my edit. You can read more (and see me get schooled by one of the original CUDA developers) [here](http://blog.cudahandbook.com/2013/08/12/why-does-cuda-cudeviceptr-use-unsigned-int-instead-of-void.aspx). – talonmies Mar 18 '14 at 17:02
Even after reading the discussion here and the blog post, it's not clear to me how the cast from unsigned int (32 bit) could possibly work on the latest NVidia GPUs which have up to 10GB of on-board memory. – Wenzel Jakob Feb 14 '19 at 15:08

score 0 · Answer 2 · edited May 23 '17 at 12:15

0

Based on talonmies' answer, it seems a reasonable (though ugly) approach would be:

#include <stdint.h>
inline cudaError_t cudaMemsetTyped<T>(void *devPtr, T value, size_t count);

#define INSTANTIATE_CUDA_MEMSET_TYPED(_nbits) \
inline cudaError_t cudaMemsetTyped<int ## _nbits ## _t>(void *devPtr, int ## _nbits ## _t value, size_t count) { \
    cuMemsetD ## _nbits( reinterpret_cast<CUdeviceptr>(devPtr), value, count); \
} \
inline cudaError_t cudaMemsetTyped<uint ## _nbits ## _t>(void *devPtr, uint ## _nbits ## _t value, size_t count) { \
    cuMemsetD ## _nbits( reinterpret_cast<CUdeviceptr>(devPtr), reinterpret_cast<uint ## _nbits ## _t>(value), count); \
} \

INSTANTIATE_CUDA_MEMSET_TYPED(8)
INSTANTIATE_CUDA_MEMSET_TYPED(16)
INSTANTIATE_CUD_AMEMSET_TYPED(32)

#undef INSTANTIATE_CUDA_MEMSET_TYPED(_nbits)

inline cudaError_t cudaMemsetTyped<float>(void *devPtr, float value, size_t count) {
    cuMemsetD32( reinterpret_cast<CUdeviceptr>(devPtr), reinterpret_cast<int>(value), count);
}

(no cuMemset64 it seems, so no double either)

edited May 23 '17 at 12:15

Community

1
1

answered Mar 18 '14 at 17:33

einpoklum

118,144
57
340
684

To be honest, you will probably find it easier and more performant to do something like [this](http://stackoverflow.com/a/10599189/681865) if you want to do 64 bit or larger types or a generic template solution – talonmies Mar 18 '14 at 19:27
Yeah, for 64 bit values I suppose I'd need a kernel (unless the hardware supports strided writes). But for upto 32 bits, the driver call should be much faster – einpoklum Mar 19 '14 at 20:08
The driver call just launches a kernel, in some circumstances it certainly used to be possible to outperform cudaMemset with a customised memset kernel. You might want to try benchmarking and see. – talonmies Mar 20 '14 at 06:13
@talonmies: Surely you jest... are you telling me I can't just turn off the power to some of the DRAM and zero it? That I have to actually write 0's everywhere? I find that somewhat hard to believe. – einpoklum Mar 20 '14 at 21:47
1

I have no idea what you are asking. My point is that device memset and device to device memcpy are implmented as kernels on the GPU, and, depending on you use case and data type, it is possible to write custom code which will perform as well as or even better than the generic code the driver launches. – talonmies Mar 24 '14 at 11:52

What's the 'right' way to implement a 32-bit memset for CUDA?

2 Answers2

Linked