Shared memory mutex with CUDA - adding to a list of items

Question

My problem is the following: I have an image in which I detect some points of interest using the GPU. The detection is a heavyweight test in terms of processing, however only about 1 in 25 points pass the test on average. The final stage of the algorithm is to build up a list of the points. On the CPU this would be implemented as:

forall pixels x,y
{
    if(test_this_pixel(x,y))
        vector_of_coordinates.push_back(Vec2(x,y));
}

On the GPU I have each CUDA block processing 16x16 pixels. The problem is that I need to do something special to eventually have a single consolidated list of points in global memory. At the moment I am trying to generate a local list of points in shared memory per block which eventually will be written to global memory. I am trying to avoid sending anything back to the CPU because there are more CUDA stages after this.

I was expecting that I could use atomic operations to implement the push_back function on shared memory. However I am unable to get this working. There are two issues. The first annoying issue is that I am constantly running into the following compiler crash:

nvcc error : 'ptxas' died with status 0xC0000005 (ACCESS_VIOLATION)

when using atomic operations. It is hit or miss whether I can compile something. Does anyone know what causes this?

The following kernel will reproduce the error:

__global__ void gpu_kernel(int w, int h, RtmPoint *pPoints, int *pCounts)
{
    __shared__ unsigned int test;
    atomicInc(&test, 1000);
}

Secondly, my code which includes a mutex lock on shared memory hangs the GPU and I don't understand why:

__device__ void lock(unsigned int *pmutex)
{
    while(atomicCAS(pmutex, 0, 1) != 0);
}

__device__ void unlock(unsigned int *pmutex)
{
    atomicExch(pmutex, 0);
}

__global__ void gpu_kernel_non_max_suppress(int w, int h, RtmPoint *pPoints, int *pCounts)
{
    __shared__ RtmPoint localPoints[64];
    __shared__ int localCount;
    __shared__ unsigned int mutex;

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    int threadid = threadIdx.y * blockDim.x + threadIdx.x;
    int blockid = blockIdx.y * gridDim.x + blockIdx.x;

    if(threadid==0)
    {
        localCount = 0;
        mutex = 0;
    }

    __syncthreads();

    if(x<w && y<h)
    {
        if(some_test_on_pixel(x,y))
        {
            RtmPoint point;
            point.x = x;
            point.y = y;

            // this is a local push_back operation
            lock(&mutex);
            if(localCount<64) // we should never get >64 points per block
                localPoints[localCount++] = point;
            unlock(&mutex);
        }
    }

    __syncthreads();

    if(threadid==0)
        pCounts[blockid] = localCount;
    if(threadid<localCount)
        pPoints[blockid * 64 + threadid] = localPoints[threadid];
}

In the example code at this site, the author manages to successfully use atomic operations on shared memory, so I am confused as to why my case does not function. If I comment out the lock and unlock lines, the code runs ok, but obviously incorrectly adding to the list.

I would appreciate some advice about why this problem is happening and also perhaps if there is a better solution to achieving the goal, since I am concerned anyway about the performance issues with using atomic operations or mutex locks.

perreal · Accepted Answer · 2012-02-28T23:21:42.017

1

I suggest using prefix-sum to implement that part to increase parallelism. To do that you need to use a shared array. Basically prefix-sum will turn an array (1,1,0,1) into (0,1,2,2,3), i.e., will calculate an in-place running exclusive sum so that you'll get per-thread write indices.

__shared__ uint8_t vector[NUMTHREADS];

....

bool emit  = (x<w && y<h);
     emit  = emit && some_test_on_pixel(x,y);
__syncthreads();
scan(emit, vector);
if (emit) {
     pPoints[blockid * 64 + vector[TID]] = point;
}

prefix-sum example:

    template <typename T>
__device__ uint32 scan(T mark, T *output) {
#define GET_OUT (pout?output:values)
#define GET_INP (pin?output:values)
  __shared__ T values[numWorkers];
  int pout=0, pin=1;
  int tid = threadIdx.x;

  values[tid] = mark;

  syncthreads();

  for( int offset=1; offset < numWorkers; offset *= 2) {
    pout = 1 - pout; pin = 1 - pout;
    syncthreads();
    if ( tid >= offset) {
      GET_OUT[tid] = (GET_INP[tid-offset]) +( GET_INP[tid]);
    }
    else {
      GET_OUT[tid] = GET_INP[tid];
    }
    syncthreads();
  }

  if(!pout)
    output[tid] =values[tid];

  __syncthreads();

  return output[numWorkers-1];

#undef GET_OUT
#undef GET_INP
}

edited Feb 28 '12 at 23:21

answered Feb 28 '12 at 19:53

perreal

94,503
21
155
181

That's quite interesting. Thankyou. – Robotbugs Feb 28 '12 at 20:03
I just tried implementing this and one thing I found is that the scan function is incorrect at the line: "temp[pout*n+thid] += temp[pin*n+thid - offset];". This should actually be "temp[pout*n+thid] = temp[pin*n+thid] + temp[pin*n+thid - offset];" – Robotbugs Feb 28 '12 at 23:09
OK I implemented basically what you have, I'll post the final code later. Thanks a lot. – Robotbugs Feb 28 '12 at 23:41
You can find more efficient scan code in the source for the [CUDPP](http://cudpp.googlecode.com) library. BTW, to do it with shared atomics (which are slow, so you shouldn't), you should just be able to use atomicInc to get the current index for each thread to address the shared array. If atomicInc is causing ptxas to crash, that's a bug, and we'd like to hear about it -- please post the issue on the NVIDIA GPU Computing forums. Generally, though I would recommend finding a higher-level way to implement this, like using thrust::copy_if with a thrust::transform_iterator. – harrism Feb 29 '12 at 01:29
@harrism, can you write a pseudo kind of code to show how to use CUDPP for this example? – perreal Feb 29 '12 at 15:50

score 1 · Answer 2 · edited May 11 '23 at 22:25

Based on recommendations here, I include the code that I used in the end. It uses 16x16 pixel blocks. Note that I am now writing the data out in one global array without breaking it up. I used the global atomicAdd function to compute a base address for each set of results. Since this only gets called once per block, I did not find too much of a slow down, while I gained a lot more convenience by doing this. I'm also avoiding shared buffers for the input and output of prefix_sum. GlobalCount is set to zero prior to the kernel call.

#define BLOCK_THREADS 256

__device__ int prefixsum(int threadid, int data)
{
    __shared__ int temp[BLOCK_THREADS*2];

    int pout = 0;
    int pin = 1;

    if(threadid==BLOCK_THREADS-1)
        temp[0] = 0;
    else
        temp[threadid+1] = data;

    __syncthreads();

    for(int offset = 1; offset<BLOCK_THREADS; offset<<=1)
    {
        pout = 1 - pout;
        pin = 1 - pin;

        if(threadid >= offset)
            temp[pout * BLOCK_THREADS + threadid] = temp[pin * BLOCK_THREADS + threadid] + temp[pin * BLOCK_THREADS + threadid - offset];
        else
            temp[pout * BLOCK_THREADS + threadid] = temp[pin * BLOCK_THREADS + threadid];

        __syncthreads();
    }

    return temp[pout * BLOCK_THREADS + threadid];
}

__global__ void gpu_kernel(int w, int h, RtmPoint *pPoints, int *pGlobalCount)
{
    __shared__ int write_base;

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    int threadid = threadIdx.y * blockDim.x + threadIdx.x;
    int valid = 0;

    if(x<w && y<h)
    {
        if(test_pixel(x,y))
        {
            valid = 1;
        }
    }

    int index = prefixsum(threadid, valid);

    if(threadid==BLOCK_THREADS-1)
    {
        int total = index + valid;
        if(total>64)
            total = 64; // global output buffer is limited to 64 points per block
        write_base = atomicAdd(pGlobalCount, total); // get a location to write them out
    }

    __syncthreads(); // ensure write_base is valid for all threads

    if(valid)
    {
        RtmPoint point;
        point.x = x;
        point.y = y;
        if(index<64)
            pPoints[write_base + index] = point;
    }
}

The only issue with using atomicAdd to coordinate writing of the results is that they end up in a random order which changes from run to run. However this does not matter much, plus its easy to sort the output vector. — Robotbugs, Feb 29 '12 at 23:34

Shared memory mutex with CUDA - adding to a list of items

2 Answers2