0

I have a kernel with a "while" loop, which iteratively updates elements of an array using information about neighbors (only one neighbor in the sample code below). This loop stops when no element is changed at the current iteration.

Unfortunately, in some situations part of threads go out of this loop prematurely (like if they ignore synchronization barrier). Some inputs are processed correctly every time, and other inputs (many of them) are processed incorrectly every time (i.e. there are no stochastic factors). Strangely, this error occurs only in Release version while Debug version always worked fine. More precisely, the CUDA compiler option "-G (Generate GPU Debug Information)" determines whether the processing is correct. Arrays of size 32x32 or smaller are always processed correctly.

Here is a sample code:

__global__ void kernel(int *source, int size, unsigned char *result, unsigned char *alpha)
{
    int x = threadIdx.x, y0 = threadIdx.y * 4;
    int i, y;
    __shared__ bool alpha_changed;

    // Zero intermediate array using margins for safe access to neighbors
    const int stride = MAX_SIZE + 2;
    for (i = threadIdx.x + threadIdx.y * blockDim.x; i < stride * (stride + 3); i += blockDim.x * blockDim.y)
    {
        alpha[i] = 0;
    }
    __syncthreads();

    for (int bit = MAX_BITS - 1; bit >= 0; bit--)
    {
        __syncthreads();

        // Fill intermediate array with bit values from input array
        alpha_changed = true;
        alpha[(x + 1) + (y0 + 1) * stride] = (source[x + (y0 + 0) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 2) * stride] = (source[x + (y0 + 1) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 3) * stride] = (source[x + (y0 + 2) * size] & (1 << bit)) != 0;
        alpha[(x + 1) + (y0 + 4) * stride] = (source[x + (y0 + 3) * size] & (1 << bit)) != 0;
        __syncthreads();

        // The loop in question
        while (alpha_changed)
        {
            alpha_changed = false;
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 1) * stride] != 0 && alpha[(x + 1) + (y0 + 1) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 1) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 2) * stride] != 0 && alpha[(x + 1) + (y0 + 2) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 2) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 3) * stride] != 0 && alpha[(x + 1) + (y0 + 3) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 3) * stride] = 1;
            }
            __syncthreads();
            if (alpha[(x + 0) + (y0 + 4) * stride] != 0 && alpha[(x + 1) + (y0 + 4) * stride] == 0)
            {
                alpha_changed = true;
                alpha[(x + 1) + (y0 + 4) * stride] = 1;
            }
            __syncthreads();
        }
        __syncthreads();

        // Save result
        result[x + (y0 + 0) * size + bit * size * size] = alpha[(x + 1) + (y0 + 1) * stride];
        result[x + (y0 + 1) * size + bit * size * size] = alpha[(x + 1) + (y0 + 2) * stride];
        result[x + (y0 + 2) * size + bit * size * size] = alpha[(x + 1) + (y0 + 3) * stride];
        result[x + (y0 + 3) * size + bit * size * size] = alpha[(x + 1) + (y0 + 4) * stride];
        __syncthreads();
    }
}

// Run only 1 thread block, where size equals 64.
kernel <<< 1, dim3(size, size / 4) >>> (source_gpu, size, result_gpu, alpha_gpu);

The expected result of this sample kernel is array, where each line can contain only contiguous intervals of "1" values. But instead of this, I get some lines, where "0" and "1" are somehow alternated.

This error is reproduced on my mobile GPU GeForce 740M (Kepler), on Windows 7 x64 SP1, on either CUDA 6.0 or 6.5, using either Visual C++ 2012 or 2013. I can also provide a sample Visual Studio project with the sample input array (i.e. which is processed incorrectly).

I have already tried different configurations of syncthreads(), fences and "volatile" qualifier, but this error remained.

Any help is appreciated.

Sam Protsenko
  • 14,045
  • 4
  • 59
  • 75
Triant
  • 3
  • 2

1 Answers1

0

I think the problem is your access to alpha_changed. Keep in mind this is only one value for all the threads in a block. There is a race condition between one warp resetting this variable, and another warp checking the loop condition:

    // The loop in question
    while (alpha_changed)
    {
        alpha_changed = false;
        // ...
        // alpha_changed may be set to true here
        // ...

        __syncthreads();

        // race condition window here. Another warp may already execute
        // the alpha_changed = false; line before this warp continues.
    }

The key thing is doing a __syncthreads() before setting the shared variable to false.

You can use a local variable inside the loop to figure out if that thread made any change. This avoids having to use __syncthreads() all over the place. Then do a reduction in the end of the loop:

    // The loop in question
    while (alpha_changed)
    {
        bool alpha_changed_here = false;
        // ...
        // alpha_changed_here may be set to true here
        // ...

        __syncthreads();
        alpha_changed = false;
        __syncthreads();
        // I think you can get away with a simple if-statement here
        // instead of a proper reduction
        if (alpha_changed_here) alpha_changed = true;
        __syncthreads();
    }

As far as I know, this method of using just one variable in shared memory currently works. If you want to be sure, use a proper reduction algorithm. You can use __any() to do a reduction for 32 values in one instruction by one warp. The algorithm to use depends on the size of your blocks (I don't know the exact behavior is the size is not a multiple of 32).

roeland
  • 5,349
  • 2
  • 14
  • 28
  • I agree that `alpha_changed` is the source of the problem, but instead of using shared memory, I would suggest using a warp-level intrinsic such as `__any()` for loop control. – ArchaeaSoftware Mar 19 '15 at 14:43
  • @ArchaeaSoftware yes, that what's I mean by doing a proper reduction. I can add it to the answer. Note that `__any()` is not block-level, so you would need a small reduction afterwards. But I think as long as a all threads writing to the shared memory writes the same value, the result is well-defined, so the simple approach will work. – roeland Mar 19 '15 at 20:59
  • Thank you for such a quick and useful response. "Doing a __syncthreads() before setting the shared variable to false" solved the problem. It appeared that only two `__syncthreads()` are necessary (at the start of the loop and at the end). The idea of using __any() also has some potential for my real kernel, but it will increase the complexity and can possibly decrease performance due to insufficient size of shared memory and number of registers. – Triant Mar 19 '15 at 22:49
  • If you need block-level primitives, use `__syncthreads_count()` or `__syncthreads_or()`. – ArchaeaSoftware Mar 20 '15 at 18:46
  • Thank you, I will consider them during kernel optimization. – Triant Mar 20 '15 at 23:28