CUDA shared memory and warp synchronization

Question

Following host code test.c and device code test0.cu are intended to give the same result.

test.c

$ cat test.c
#include <stdio.h>
#include <string.h>

int main()
{
        int data[32];
        int dummy[32];

        for (int i = 0; i < 32; i++)
                data[i] = i;

        memcpy(dummy, data, sizeof(data));
        for (int i = 1; i < 32; i++)
                data[i] += dummy[i - 1];
        memcpy(dummy, data, sizeof(data));
        for (int i = 2; i < 32; i++)
                data[i] += dummy[i - 2];
        memcpy(dummy, data, sizeof(data));
        for (int i = 4; i < 32; i++)
                data[i] += dummy[i - 4];
        memcpy(dummy, data, sizeof(data));
        for (int i = 8; i < 32; i++)
                data[i] += dummy[i - 8];
        memcpy(dummy, data, sizeof(data));
        for (int i = 16; i < 32; i++)
                data[i] += dummy[i - 16];

        printf("kernel  : ");
        for (int i = 0; i < 32; i++)
                printf("%4i ", data[i]);
        printf("\n");
}
$

test0.cu

$ cat test0.cu
#include <stdio.h>

__global__ void kernel0(int *data)
{
        size_t t_id = threadIdx.x;

        if (1 <= t_id)
                data[t_id] += data[t_id - 1];
        if (2 <= t_id)
                data[t_id] += data[t_id - 2];
        if (4 <= t_id)
                data[t_id] += data[t_id - 4];
        if (8 <= t_id)
                data[t_id] += data[t_id - 8];
        if (16 <= t_id)
                data[t_id] += data[t_id - 16];
}

int main()
{
        int data[32];
        int result[32];

        int *data_d;
        cudaMalloc(&data_d, sizeof(data));

        for (int i = 0; i < 32; i++)
                data[i] = i;

        dim3 gridDim(1);
        dim3 blockDim(32);

        cudaMemcpy(data_d, data, sizeof(data), cudaMemcpyHostToDevice);
        kernel0<<<gridDim, blockDim>>>(data_d);
        cudaMemcpy(result, data_d, sizeof(data), cudaMemcpyDeviceToHost);

        printf("kernel0 : ");
        for (int i = 0; i < 32; i++)
                printf("%4i ", result[i]);
        printf("\n");
}
$

If I compile and run them, they do give the same result as I expected.

$ gcc -o test test.c
$ ./test
kernel  :    0    1    3    6   10   15   21   28   36   45   55   66   78   91  105  120  136  153  171  190  210  231  253  276  300  325  351  378  406  435  465  496
$ nvcc -o test_dev0 test0.cu
$ ./test_dev0
kernel0 :    0    1    3    6   10   15   21   28   36   45   55   66   78   91  105  120  136  153  171  190  210  231  253  276  300  325  351  378  406  435  465  496
$

However, if I use shared memory instead of global memory in the device code, as in test1.cu, it gives different result.

test1.cu

$ cat test1.cu
#include <stdio.h>

__global__ void kernel1(int *data)
{
        __shared__ int data_s[32];

        size_t t_id = threadIdx.x;

        data_s[t_id] = data[t_id];

        if (1 <= t_id)
                data_s[t_id] += data_s[t_id - 1];
        if (2 <= t_id)
                data_s[t_id] += data_s[t_id - 2];
        if (4 <= t_id)
                data_s[t_id] += data_s[t_id - 4];
        if (8 <= t_id)
                data_s[t_id] += data_s[t_id - 8];
        if (16 <= t_id)
                data_s[t_id] += data_s[t_id - 16];

        data[t_id] = data_s[t_id];
}

int main()
{
        int data[32];
        int result[32];

        int *data_d;
        cudaMalloc(&data_d, sizeof(data));

        for (int i = 0; i < 32; i++)
                data[i] = i;

        dim3 gridDim(1);
        dim3 blockDim(32);

        cudaMemcpy(data_d, data, sizeof(data), cudaMemcpyHostToDevice);
        kernel1<<<gridDim, blockDim>>>(data_d);
        cudaMemcpy(result, data_d, sizeof(data), cudaMemcpyDeviceToHost);

        printf("kernel1 : ");
        for (int i = 0; i < 32; i++)
                printf("%4i ", result[i]);
        printf("\n");
}
$

If I compile test1.cu and run it, it gives different result from test0.cu or test.c.

$ nvcc -o test_dev1 test1.cu
$ ./test_dev1
kernel1 :    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31
$

Is warp synchronization not supposed to work with shared memory?

Some investigation into this issue:

When using CUDA8.0, if I compile test1.cu with -arch=sm_61 option(I'm testing with GTX 1080), it gives same result as test0.cu and test.c.

$ nvcc -o test_dev1_arch -arch=sm_61 test1.cu
$ ./test_dev1_arch
kernel1 :    0    1    3    6   10   15   21   28   36   45   55   66   78   91  105  120  136  153  171  190  210  231  253  276  300  325  351  378  406  435  465  496
$

But this does not apply to newer versions of CUDA. If I use any newer version than 8.0, the test result is different even if I give the -arch=sm_61 option.

Michael Kenzel · Answer 1 · 2019-01-08T17:35:29.113

Your device code has undefined behavior due to race conditions in both cases, using shared memory or using global memory. You have multiple threads that concurrently read and modify the same int object.

Is warp synchronization not supposed to work with shared memory?

I don't see any warp synchronization in your code.

The fact that the hardware executes warps in lock step (which is not necessarily true to begin with) is completely irrelevant because it is not the hardware who reads your C++ code. It is whatever toolchain you use to translate your C++ code into the machine code that will actually run on your hardware. And C++ compilers are allowed to optimize based on the abstract rules of the C++ language.

Let's look at the machine code that's actually generated for your example (using CUDA 10 here on my machine):

_Z7kernel1Pi:
        /*0008*/                   MOV R1, c[0x0][0x20] ;
        /*0010*/                   S2R R9, SR_TID.X ;
        /*0018*/                   SHL R8, R9.reuse, 0x2 ;
        /*0028*/                   SHR.U32 R0, R9, 0x1e ;
        /*0030*/                   IADD R2.CC, R8, c[0x0][0x140] ;
        /*0038*/                   IADD.X R3, R0, c[0x0][0x144] ;
        /*0048*/                   LDG.E R0, [R2] ;
        /*0050*/                   ISETP.NE.AND P0, PT, R9.reuse, RZ, PT ;
        /*0058*/                   ISETP.GE.U32.AND P1, PT, R9, 0x2, PT ;
        /*0068*/               @P0 LDS.U.32 R5, [R8+-0x4] ;
        /*0070*/         {         ISETP.GE.U32.AND P2, PT, R9.reuse, 0x4, PT ;
        /*0078*/               @P1 LDS.U.32 R6, [R8+-0x8]         }
        /*0088*/                   ISETP.GE.U32.AND P3, PT, R9, 0x8, PT ;
        /*0090*/               @P2 LDS.U.32 R7, [R8+-0x10] ;
        /*0098*/         {         ISETP.GE.U32.AND P4, PT, R9, 0x10, PT   SLOT 0;
        /*00a8*/               @P3 LDS.U.32 R9, [R8+-0x20]   SLOT 1        }
        /*00b0*/               @P4 LDS.U.32 R10, [R8+-0x40] ;
        /*00b8*/         {         MOV R4, R0 ;
        /*00c8*/                   STS [R8], R0         }
        /*00d0*/               @P0 IADD R5, R4, R5 ;
        /*00d8*/         {     @P0 MOV R4, R5 ;
        /*00e8*/               @P0 STS [R8], R5         }
        /*00f0*/               @P1 IADD R6, R4, R6 ;
        /*00f8*/         {     @P1 MOV R4, R6 ;
        /*0108*/               @P1 STS [R8], R6         }
        /*0110*/               @P2 IADD R7, R4, R7 ;
        /*0118*/         {     @P2 MOV R4, R7 ;
        /*0128*/               @P2 STS [R8], R7         }
        /*0130*/               @P3 IADD R9, R4, R9 ;
        /*0138*/         {     @P3 MOV R4, R9 ;
        /*0148*/               @P3 STS [R8], R9         }
        /*0150*/               @P4 IADD R10, R4, R10 ;
        /*0158*/               @P4 STS [R8], R10 ;
        /*0168*/               @P4 MOV R4, R10 ;
        /*0170*/                   STG.E [R2], R4 ;
        /*0178*/                   EXIT ;
.L_1:
        /*0188*/                   BRA `(.L_1) ;
.L_14:

As you can see, the compiler (in this particular case, the "culprit" was actually the PTX assembler) has translated your sequence of ifs into a bunch of instructions that set up predicates based on the if conditions. It first fetches all the values it's ever going to need from shared memory into registers using conditional loads. Only after that, it performs all the additions and conditional stores using the already loaded values. This is a perfectly legal interpretation of your C++ code. Since you did not specify any synchronization or memory ordering constraints, the compiler can operate under the assumption that there are no potentially concurrent conflicts, and all these loads and stores can be reordered in whatever way it sees fit.

To fix your code, use explicit warp synchronization:

__global__ void kernel1(int *data)
{
        __shared__ int data_s[32];

        size_t t_id = threadIdx.x;

        data_s[t_id] = data[t_id];

        __syncwarp();
        if (1 <= t_id)
                data_s[t_id] += data_s[t_id - 1];
        __syncwarp();
        if (2 <= t_id)
                data_s[t_id] += data_s[t_id - 2];
        __syncwarp();
        if (4 <= t_id)
                data_s[t_id] += data_s[t_id - 4];
        __syncwarp();
        if (8 <= t_id)
                data_s[t_id] += data_s[t_id - 8];
        __syncwarp();
        if (16 <= t_id)
                data_s[t_id] += data_s[t_id - 16];

        data[t_id] = data_s[t_id];
}

The reason why this problem only manifests starting with CUDA 9.0 is that warp-level synchronization was only really introduced in CUDA 9.0 when Volta and "independent thread scheduling" made it a necessity. Before CUDA 9.0, warp-synchronous programming was not officially supported. But compilers used to be rather conservative when it came to actually breaking code like in your example above. The reason probably being that such "warp-synchronous" programming (note the quotes) was often the only way to even get close to peak performance, there was no real alternative and, thus, people were doing it all the time. It still was undefined behavior, though, and NVIDIA kept warning us. It just happened to just work in many cases…

nglee · Answer 2 · 2019-01-15T16:28:03.943

It seems that the point I missed was to declare the shared memory with volatile qualifier. This fixed the issue. (Test code)

However, as stated in the answer by Michael Kenzel, this kind of implicit warp-synchronous programming should be generally avoided, even though this is introduced in the classic parallel reduction(on page 22) provided by NVIDIA itself.

Since future compiler and memory hardware might work differently, it would be dangerous to rely on it. Using __syncwarp() similar to the solution provided by Michael Kenzel should be a better solution. With help from this NVIDIA dev blog article, the safe solution would be:

__global__ void kernel(int *data)
{
    __shared__ int data_s[32];

    size_t t_id = threadIdx.x;

    data_s[t_id] = data[t_id];

    int v = data_s[t_id];

    unsigned mask = 0xffffffff;     __syncwarp(mask);

    mask = __ballot_sync(0xffffffff, 1 <= t_id);
    if (1 <= t_id) {
        v += data_s[t_id - 1];  __syncwarp(mask);
        data_s[t_id] = v;       __syncwarp(mask);
    }
    mask = __ballot_sync(0xffffffff, 2 <= t_id);
    if (2 <= t_id) {
        v += data_s[t_id - 2];  __syncwarp(mask);
        data_s[t_id] = v;       __syncwarp(mask);
    }
    mask = __ballot_sync(0xffffffff, 4 <= t_id);
    if (4 <= t_id) {
        v += data_s[t_id - 4];  __syncwarp(mask);
        data_s[t_id] = v;       __syncwarp(mask);
    }
    mask = __ballot_sync(0xffffffff, 8 <= t_id);
    if (8 <= t_id) {
        v += data_s[t_id - 8];  __syncwarp(mask);
        data_s[t_id] = v;       __syncwarp(mask);
    }
    mask = __ballot_sync(0xffffffff, 16 <= t_id);
    if (16 <= t_id) {
        v += data_s[t_id - 16]; __syncwarp(mask);
        data_s[t_id] = v;
    }

    data[t_id] = data_s[t_id];
}

CUDA shared memory and warp synchronization

2 Answers2