6

How do I initialize device array which is allocated using cudaMalloc()?

I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.

cudaMemset(devPtr,value,number_bytes)
CharlesB
  • 86,532
  • 28
  • 194
  • 218
username_4567
  • 4,737
  • 12
  • 56
  • 92
  • 1
    Could you provide your code there you call `cudaMemset`? – geek May 14 '12 at 20:31
  • 3
    You understand that the value in `cudaMemset` is a *byte* value, not a word value, ie. the same as in the C standard library `memset`? – talonmies May 15 '12 at 09:16

2 Answers2

13

As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:

cudaError_t cudaMemset  (   void *      devPtr,
                            int         value,
                            size_t      count    
                        )           

Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.

So value is a byte value. If you do something like:

int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);

what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.

Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as

template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
    int tidx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;

    for(; tidx < nwords; tidx += stride)
        devPtr[tidx] = val;
}

(standard disclaimer: written in browser, never compiled, never tested, use at own risk).

Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.

Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Currently,in my code i'm doing same thing but I wanted way with only cudaMemset. – username_4567 May 15 '12 at 14:12
  • 1
    @user997704: There isn't a way with `cudaMemset`. Either run a kernel of your own, or use `cuMemsetD32`/`cuMemsetD32` from the driver API – talonmies May 16 '12 at 09:30
  • It would be nice for self containment what would be a sensible grid and block size to invoke such kernel e.g. if my array is N=2333 how would I call this kernel? – SkyWalker Jan 06 '14 at 15:13
  • I actually don't really understand the stride loop. Why you need a loop at all? Shouldn't each thread set one element and be done with it? – SkyWalker Jan 23 '14 at 14:40
  • 1
    @GiovanniAzua I believe the stride is there to have one thread initialize more than one element in the array. Sometimes is better to give more work to a thread than launching a lot of threads. – BRabbit27 Aug 06 '14 at 07:43
1

I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.

Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:

template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value, 
   const size_t n, const size_t incx) {
      int tid = threadIdx.x + blockDim.x * blockIdx.x;
      if (tid*incx < n) {
           a[tid*incx] = value;
       }
}

Then you may invoke the kernel like this:

template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
      int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
      dim3 gridDim(number_of_blocks, 1);
      dim3 blockDim(BLOCK_SIZE, 1);
      kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}
SkyWalker
  • 13,729
  • 18
  • 91
  • 187
  • 2
    Have a look at [this answer](http://stackoverflow.com/a/21340927/681865) if you want to understand why the loop is a good idea. – talonmies Jan 25 '14 at 14:56