My cuda kernel generates something that is fed to host in the end of block execution.
The skeleton is as follows.
host_data where data is written to is allocated as host mapped memory.
host_data_count is also mapped memory which indicates the number of data produced.
The GPU I'm using is GTX 580 with Fermi architecture and CC 2.0.
__global__ void kernel(host_data, host_data_count)
{
__shared__ int shd_data[1024];
__shared__ int shd_cnt;
int i;
if (threadIdx.x == 0)
shd_cnt = 0;
__syncthreads();
while ( ... )
{
if (something happens)
{
i = atomicAdd(&shd_cnt, 1);
shd_data[i] = d;
}
}
__syncthreads();
if (threadIdx.x == 0)
{
i = atomicAdd(host_data_count, shd_cnt);
memcpy(&host_data[i], shd_data, shd_cnt * 4);
}
}
What am I missing in this kernel code?
Can anybody help?