CUDA C/C++: Same executable gives different results from the first run

Question

Take my previous answered question: My Previous Question, which, by the way, was properly answered by Robert Crovella.

I came up with another kernel that computes a random step to a point (by using the same RNG from my previous question) and calculates the difference of energy of that point with respect to its previous position (coordinates). This is the kernel:

__global__
void DeltaE(float *X, float *Y,float *Xn, float *Yn, float *step,float *DELTA, curandState *state, const int N,const int n){

    int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
    int bIdx = blockIdx.x;
    //int sIdx = blockIdx.x*blockDim.x;

    float x , y;
    float rdmn1, rdmn2;

    float dIntE = 0.0e0f, dConfE = 0.0e0f, dTotE = 0.0e0f;
    if(tIdx < N){

        if(tIdx == n){
            step[tIdx] = 0.2;
            rdmn1 = curand_uniform(&state[tIdx]);
            rdmn2 = curand_uniform(&state[tIdx]);

            Xn[tIdx] = X[tIdx] + step[tIdx]*(2.0e0f*rdmn1 - 1.0e0f);
            Yn[tIdx] = Y[tIdx] + step[tIdx]*(2.0e0f*rdmn2 - 1.0e0f);
            dConfE = - (X[tIdx]*X[tIdx] + Y[tIdx]*Y[tIdx]);
            dConfE += Xn[tIdx]*Xn[tIdx] + Yn[tIdx]*Yn[tIdx];

        }
        else{
            x = X[tIdx] - X[n];
            y = Y[tIdx] - Y[n];

            dIntE += -1.0e0f/sqrt(x*x + y*y);
        }
        __syncthreads();
        if(tIdx != n){
            x = X[tIdx] - Xn[n];
            y = Y[tIdx] - Yn[n];

            dIntE += 1.0e0f/sqrt(x*x + y*y);
        }       
        dTotE = dConfE + dIntE;
        dTotE = ReduceSum2(dTotE);
        if(threadIdx.x == 0)DELTA[bIdx] = dTotE;


    }
}

Then I do the final sum on the CPU:

cudaMemcpy(&delta[0],&d_delta[0],blocks.x*sizeof(float),cudaMemcpyDeviceToHost);
float dE = 0;
for(int i = 0; i < blocks.x; i++){
    dE += delta[i];
}

My kernel is launched with the following configuration:

dim3 threads(BlockSize,BlockSize);
dim3 blocks(ceil(Np/threads.x),ceil(Np/threads.y));
DeltaE<<<blocks.x,threads.x,threads.x*sizeof(float)>>>(d_rx,d_ry,d_rxn,d_ryn,d_step,d_delta,d_state,Np,nn);

Where Np is the number of points (I used 1k - 4k). I have a GeForce 9500 GT, which has no support to double. And I compile using no flag/no option.

Take Np = 1k, for example. When I compile and then run, the result is dE = 6.557993. When I run a second, third, fourth, whatever time, it is dE = -0.3515406. Does anyone know where this come from?

P.S.: I forgot to mention, the same kernel AvgDistance that can be found at My Previous Question is called right before DeltaE. I don't know if this has anything to do, but I thought it was worth to mention.

P.S.2: nn is any chosen point(particle).

Your cudamemcpy is copying `d_delta` back to the host variable `delta`. Then your loop is summing `dE += energy[i];` What is the connection between `delta` and `energy` on the host? — Robert Crovella, Mar 15 '13 at 22:13
oops, sorry...it is just a typing mistake...I'll correct it! — Bessa, Mar 15 '13 at 23:52
OK, I extended the sample app that I created for the previous question with the new code, and it seems to run the same every time. When you say "the first time" what do you mean? The first time after you compile it? The first time after you reboot the machine? If it's compiling, do you mean that every time you recompile it it gives different behavior? — Robert Crovella, Mar 15 '13 at 23:56
I mean that after the first time I run it, right after compile the code, it gives a result X. If I run it right after that the result is Y. And if I run a third time, right after the second, the result is again Y. I didn't reboot the machine neither waited more than a few seconds to run repeatedly many times the application to check weather it will return the same result Y. I hope it is clear now. — Bessa, Mar 16 '13 at 00:23
Not really. The first time you got result A. The second, third, fourth, fifth times you get result B. Are you ever able to get result A again? What do you have to do to get result A again? — Robert Crovella, Mar 16 '13 at 00:32
I don't ever get A again. I know, it is really weird, but is happening. Bottom line, for a given Np, right after I compile the code, I obtain A if a run it once for the first time, and then I get B for every other run. And they are quiet different result. For instance, for a seed = 337459, I'm getting dE = -64.62045 at the first time I run the app. If I run it again, I get dE = 1.919965. And then again dE = 1.919965. — Bessa, Mar 16 '13 at 00:42
I think the problem is that you are depending on a grid-wide synchronization at your `__synchthreads()` call, but that barrier only synchronizes the threads within a block. After that barrier, your code uses `Xn[n]` and `Yn[n]` to compute `x` and `y`, but `Xn[n]` and `Yn[n]` may not be computed yet, depending on the order in which the threadblocks execute. You could try breaking it up into 2 kernels, the first of which includes everything up to the `__syncthreads()` call, and the second of which includes the remainder, to create a grid-wide sync point. — Robert Crovella, Mar 16 '13 at 00:52
Hum... I didn't know `__syncthreads()` was barrier just for threads within a block. Later I'll try what try what you told and let you know. — Bessa, Mar 16 '13 at 01:02

score 2 · Accepted Answer · edited May 23 '17 at 11:50

As was pointed out by Robert Crovella via comment above, what probably was happening is that while tIdx = n was computing Xn[n] and Yn[n], other threads was using this value and it may not be computed yet. In that case, the only reason to the other runs (other then the first one) to get ways the same (correct) value is that the memory pointed to by Xn and Yn is already occupied with the right value, and even with that synchronization problem the application returns the right value.

In any case, I avoided the synchronization problem by splitting the kernel into two, just as I was advised by Robert Crovella via comment:

__global__
void DeltaE1(float *X, float *Y,float *Xn, float *Yn, float *step,float *DELTA, curandState *state, const int N,const int n){

    int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
    float x , y;
    float rdmn1, rdmn2;

    if(tIdx < N){
        DELTA[tIdx] = 0.0e0f;
        if(tIdx == n){
            step[tIdx] = 0.2e0f;
            rdmn1 = curand_uniform(&state[tIdx]);
            rdmn2 = curand_uniform(&state[tIdx]);

            Xn[tIdx] = X[tIdx] + step[tIdx]*(2.0e0f*rdmn1 - 1.0e0f);
            Yn[tIdx] = Y[tIdx] + step[tIdx]*(2.0e0f*rdmn2 - 1.0e0f);
            DELTA[tIdx] = - (X[tIdx]*X[tIdx] + Y[tIdx]*Y[tIdx]);
            DELTA[tIdx] += Xn[tIdx]*Xn[tIdx] + Yn[tIdx]*Yn[tIdx];

        }
        else{
            x = X[tIdx] - X[n];
            y = Y[tIdx] - Y[n];

            DELTA[tIdx] += -1.0e0f/sqrt(x*x + y*y);
        }           
    }
}

__global__
void DeltaE2(float *X, float *Y,float *Xn, float *Yn,float *DELTA,const int N,const int n){

    int tIdx = blockIdx.x*blockDim.x + threadIdx.x;
    int bIdx = blockIdx.x;

    float x , y;
    float dTotE = 0.0e0f;
    if(tIdx < N){
        if(tIdx != n){
            x = X[tIdx] - Xn[n];
            y = Y[tIdx] - Yn[n];

            DELTA[tIdx] += 1.0e0f/sqrt(x*x + y*y);

        }
        dTotE = DELTA[tIdx];
        dTotE = ReduceSum2(dTotE);
        if(threadIdx.x == 0)DELTA[bIdx] = dTotE;

    }

}

CUDA C/C++: Same executable gives different results from the first run

1 Answers1