Understanding the use of memset in CUDA device code

Question

I have a linear int array arr, which is on CUDA global memory. I want to set sub-arrays of arr to defined values. The sub-array start indexes are given by the starts array, while the length of each sub-array is given in counts array.

What I want to do is to set the value of sub-array i starting from starts[i] and continuing upto counts[i] to the value starts[i]. That is, the operation is:

arr[starts[i]: starts[i]+counts[i]] = starts[i]

I thought of using memset() in the kernel for setting the values. However, it is not getting correctly written ( the array elements are being assigned some random values). The code I am using is:

#include <stdlib.h>
__global__ void kern(int* starts,int* counts, int* arr,int* numels)
{
    unsigned int idx = threadIdx.x + blockIdx.x*blockDim.x;

    if (idx>=numels[0])
        return;

    const int val = starts[idx];
    memset(&arr[val], val, sizeof(arr[0])*counts[idx]) ;
    __syncthreads();
}

Please note that numels[0] contains the number of elements in starts array.

I have checked the code with cuda-memcheck() but didn't get any errors. I am using PyCUDA, if it's relevant. I am probably misunderstanding the usage of memset here, as I am learning CUDA.

Can you please suggest a way to correct this? Or other efficient way of doint this operation.

P.S: I know that thrust::fill() can probably do this well, but since I am learning CUDA, I would like to know how to do this without using external libraries.

Is there any other efficient way to do the same operation on CUDA? — Geralt, Jul 29 '18 at 10:04

score 1 · Accepted Answer · answered Jul 29 '18 at 10:28

The memset and memcpy implementations in CUDA device code emit simple, serial, byte values operations (and note that memset can't set anything other than byte values, which might be contributing to the problem you see if the values you are trying to set are larger than 8 bits).

You could replace the memset call with something like this:

const int val = starts[idx];
//memset(&arr[val], val, sizeof(arr[0])*counts[idx]) ;
for(int i = 0; i < counts[idx]; i++)
    arr[val + i] = val;

The performance of that code will probably be better than the built-in memset.

Note also that the __syncthreads() call at the end of your kernel is both unnecessary, and a potential source of deadlock and should be removed. See here for more information.

Understanding the use of memset in CUDA device code

1 Answers1