CUDA: atomic operation on shared memory

Question

My cuda kernel generates something that is fed to host in the end of block execution.
The skeleton is as follows.
host_data where data is written to is allocated as host mapped memory.
host_data_count is also mapped memory which indicates the number of data produced.
The GPU I'm using is GTX 580 with Fermi architecture and CC 2.0.

__global__ void kernel(host_data, host_data_count)
{
    __shared__ int  shd_data[1024];
    __shared__ int  shd_cnt;
    int i;

    if (threadIdx.x == 0)
        shd_cnt = 0;
    __syncthreads();

    while ( ... )
    {
        if (something happens)
        {
            i = atomicAdd(&shd_cnt, 1);
            shd_data[i] = d;
        }
    }

    __syncthreads();
    if (threadIdx.x == 0)
    {
        i = atomicAdd(host_data_count, shd_cnt);
        memcpy(&host_data[i], shd_data, shd_cnt * 4);
    }
}

What am I missing in this kernel code?
Can anybody help?

Could you edit your question to explain what your problem is, because it is very unclear. Also, if you want help fixing a particular piece of code, please post *actual code* which demonstrates the problem and which someone else could compile and run. Debugging pseudocode is very difficult. — talonmies, Feb 03 '12 at 11:42
Could you also confirm that you are compiling your code with the option -arch=sm_20? — pQB, Feb 03 '12 at 12:46

score 1 · Answer 1 · answered Feb 05 '12 at 13:31

It's hard to tell what you are missing, because you didn't actually state what is the problem you are facing. I see few possibilities, but this depends on actual implementation of some of your higher concepts.

I assume a host_data_count is of type int* (or similar?). It either points to global memory, or to host memory through mapped-pinned memory. I would strongly suggest to actually use global memory for the sake of speed.
If host_data_count is a pinned memory, keep in mind, the atomic operations are atomic only within GPU. If, in the meantime, CPU does something with it, it may break the atomicy. You will, most likely, need to synchronize the host thread after the kernel call and before reading/using the value. Kernel calls are always asynchronous.
What is memcpy in device code? I assume you implemented it yourself, right? Are you copying memory using a single thread, or a whole block? Using whole block will be faster, but then you need to use this function outside the if (threadIdx.x==0), and the variable i must be shared.

CUDA: atomic operation on shared memory

1 Answers1