cuda:multiple threads access the same global variable

Question

#define dimG 16
#define dimB 64

// slovebyGPU
__global__ void SloveStepGPU(float* X, float* Y, int * iCons, int* jCons, int * dCons, float* wCons, int cnt, float c)
{
    int id = blockDim.x * blockIdx.x + threadIdx.x;
    for (int i = id; i<cnt; i += dimG*dimB) {

        int I = iCons[i];
        int J = jCons[i];
        int d = dCons[i];
        float wc = 1.0f*wCons[i]*c;

        if (wc > 1.0)wc = 1.0;

        float XI = atomicAdd(&(X[I]), 0);
        float XJ = atomicAdd(&(X[J]), 0);
        float YI = atomicAdd(&(Y[I]), 0);
        float YJ = atomicAdd(&(Y[J]), 0);
        float pqx = XI - XJ;
        float pqy = YI - YJ;
        float mag = sqrtf(pqx*pqx + pqy*pqy);
        float r = 1.0f*(d - mag) / 2;
        float mx = wc * r * pqx / (mag + eps);
        float my = wc * r * pqy / (mag + eps);
        if (d == 1) {
            atomicAdd(&(X[I]), mx);
            atomicAdd(&(Y[I]), my);
        }
        atomicAdd(&(X[J]), -mx);
        atomicAdd(&(Y[J]), -my);
}

In this code, I know that X, Y may have data races. My previous thought was: Allowed reading of XI, XJ, YI, YJ may not be the latest data. However, I found that in the process of data race, it may cause XI, XJ, YI, YJ to read random memory values. That is, a memory access violation. Even if I add a lock during reading and writing, I still get the same result. Only when I reduce the size of dimB and dimG so that there is almost no data race, can I get the correct result. Is there any solution?

I use 64-bit compilation under windows + vs2015 + cuda9.1 environment.

However, I used the same code under linux and found no problems.

There is no problem when using nsight cuda debugger under windows. The reason is probably that running with debugger is slow and does not cause data race.

-------update line----- delete other code

could you add the code for setting `iCons`, `jCons` and `dCons`? — , Dec 16 '19 at 06:37
Thanks for answering, I updated the question.Do you think the bug is caused by undefined behavior or initialization problems, right? I checked, I think the reason is data race. But even if I use atomic functions, I still get the same bug.This bug only appears on Windows, but not on Linux. — fahai zhong, Dec 16 '19 at 07:05
I am interested in the elements of `iCons` to understand the values of `I` and `J`, since you use them as indices for `X` and `Y` and I have no clue what these indices may be. If all elements of `iCons` are zero, then you have your reason for the race. If they are greater than the size of `X` then you may have undefined behavior. — , Dec 16 '19 at 08:03
Is there a difference between `&(X[I])` and `X+I`? Maybe there is a reason why get a race reported even, when you wrap it in atomics? — , Dec 16 '19 at 08:13
`iCons`,`jCons`,`dCons`,`wCons` describes some constraints.`iCons` ,`jCons` is a index. Its size will not exceed `GN`.`X` and `Y` is a randomly generated value of 0-1. To be sure, XY, iCons, etc. should be no problem, because they are right in the CPU solve function.I updated the CPU slove function. — fahai zhong, Dec 16 '19 at 08:32

score 1 · Answer 1 · edited Dec 16 '19 at 18:41

The problem appeared in this if (d == 1), I replaced the if with the device function fminf,fmaxf and so on to solve the problem. I am guessing that the branch was entered in the same warp, and there was data competition and some processes were suspended, which caused strange problems.

if (d == 1) {
            atomicAdd(&(X[I]), mx);
            atomicAdd(&(Y[I]), my);
        }

to

float fd = fmaxf(2.0f - d, 0.0f);
X[I] += fd * 1.0f * mx;
Y[I] += fd * 1.0f * my;

cuda:multiple threads access the same global variable

1 Answers1