Why the result of simple CUDA program differs every time?

Question

I want to make a simple tiling convolution code. It is from the lecture of Coursera :Heterogeneous Parallel Programming. The lecture provides a simple convolution code with tiling method, but the code was not complete. Thus, I fill out the blanks in the code and below is the result.

The goal of this code is to calculate convolution. Input dimension : (24 by 24),
Kernel size : (9 by 9), Output dimension : (16 by 16).

Also, in the main, I put computation time checking code to compare with the CPU version.

The problem is, whenever I run this code, the result is different. I tried to find the problem for several day, but every trial did not work. I also found similar code in the Internet blog but it has same problem as mine. I do not know why the result is different every time. Some say it is due to the race condition, but I don't find anything about that.

Here is example result of convolution (16 by 16 size).

81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
0 0 0 0 0 0 0 0 0 0 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81

81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0
81 81 81 81 81 81 81 81 81 81 0 0 0 0 0 0

My device is CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 630. I use Ubuntu 14.04.

Thanks in advance!

#include<stdio.h>
#include<cuda.h>
#include<time.h>

#define O_TILE_WIDTH 10
#define MASK_WIDTH 9
#define I_TILE_WIDTH (O_TILE_WIDTH+MASK_WIDTH-1)

__global__ void Convolution2DBasicKernel(float *out, float *in, int in_height, int in_width, const float *__restrict__ mask, int output_dim)
{

    int tx=threadIdx.x;
    int ty=threadIdx.y;


    int row_o=blockIdx.y*O_TILE_WIDTH+ty;
    int col_o=blockIdx.x*O_TILE_WIDTH+tx; 


    int row_i=row_o;
    int col_i=col_o;
    __syncthreads();


    __shared__ float Ns[I_TILE_WIDTH][I_TILE_WIDTH];

///////////////////////////////////////////////////////////
//////////////////// reading input data ///////////////////
    if( (row_i>=0)&&(row_i<in_height)&&(col_i>=0)&&(col_i<in_width) )
    {
        Ns[ty][tx]=in[row_i*in_width + col_i];
    }
    else
    {
        Ns[ty][tx]=0.0f;
    }
    __syncthreads();    


///////////////////////////////////////////////////////////
//////////////////// calculating convol ///////////////////
    float output=0.0f;
    if( (tx<O_TILE_WIDTH)&&(ty<O_TILE_WIDTH) )
    {
        for(int i=0; i<MASK_WIDTH; i++)
        {
            for(int j=0; j<MASK_WIDTH; j++)
            {
                output += Ns[ty+i][tx+j]*mask[i*MASK_WIDTH+j];

            }
        }

    }
    __syncthreads();


    if( (row_o<output_dim)&&(col_o<output_dim) )
    {
        out[row_o*output_dim+col_o]=output;//in_width
    }
    __syncthreads();
}

int main() {

int input_dim=24;
    int kernel_dim=9;
    int output_dim=16;


float *input = new float[input_dim*input_dim];
float *kernel = new float[kernel_dim*kernel_dim];
float *output = new float[output_dim*output_dim];

float *d_input;
float *d_kernel;
float *d_output;
cudaMalloc(&d_input, sizeof(float)*input_dim*input_dim);
cudaMalloc(&d_kernel, sizeof(float)*kernel_dim*kernel_dim);
cudaMalloc(&d_output, sizeof(float)*output_dim*output_dim);



for(int i=0; i<input_dim*input_dim; i++)
{
    input[i]=1.0;
}
for(int i=0; i<kernel_dim*kernel_dim; i++)
{
    kernel[i]=1.0;
}


cudaMemcpy(d_input, input, sizeof(float)*input_dim*input_dim, cudaMemcpyHostToDevice);
cudaMemcpy(d_kernel, kernel, sizeof(float)*kernel_dim*kernel_dim, cudaMemcpyHostToDevice);



dim3 dimBlock (I_TILE_WIDTH, I_TILE_WIDTH, 1);
dim3 dimGrid ((output_dim-1)/O_TILE_WIDTH+1, (output_dim-1)/O_TILE_WIDTH+1, 1);


clock_t begin, end;
double time_spent;
begin = clock();

for(int iteration=0; iteration<1; iteration++)//100000
{
    Convolution2DBasicKernel<<<dimGrid, dimBlock>>>(d_output, d_input, input_dim, input_dim, d_kernel, output_dim);
}

end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("time: %f\n", time_spent);

cudaMemcpy(output, d_output, sizeof(float)*output_dim*output_dim, cudaMemcpyDeviceToHost);

for(int y=0; y<output_dim; y++)
{
    for(int x=0; x<output_dim; x++)
        printf("%d\t", int(output[y*16+x]));
    printf("\n");
}


}

brano · Accepted Answer · 2015-12-01T12:31:07.920

1

It is a race condition.
Here is one example.
You are launching 18x18 threads per block and 2x2 blocks.
Example:
ThreadA threadIdx.x = 10 threadIdx.y = 0 blockIdx.x = 0 blockIdx.y = 0
ThreadB threadIdx.x = 0 threadIdx.y = 0 blockIdx.x = 1 blockIdx.y = 0

Inside the kernel when you calculate:
int tx=threadIdx.x
int ty=threadIdx.y
int row_o=blockIdx.y*O_TILE_WIDTH+ty
int col_o=blockIdx.x*O_TILE_WIDTH+tx
using O_TILE_WIDTH = 10

ThreadA row_o = 0*10+10 = 10 col_o = 0
ThreadB row_o = 1*10+0 = 10 col_o = 0

Which means that two thread will output the result at the same position in the memory but calculate it differently.

edited Dec 01 '15 at 12:31

answered Dec 01 '15 at 12:07

brano

2,822
19
15

Thank you for your answer. As a novice, I am not sure that I understand your answer well, but I guess the race condition you mentioned will be prevented by the code [if( (tx – Minkyu Choi Dec 01 '15 at 13:24
I believe this is the correct answer. You are launching 4 blocks (2x2) which can run in any order. Since the 4 blocks are writing to the same output locations, but computing different results (either 81, or 0), you have a race condition. The `if` statement you've pointed out does not fix or address the issue, but only regulates whether a given thread's `output` value contains 0 or 81. At the line where the `output` value is written to the `out` array, the race will occur. You should reorganize your blocks so that each thread corresponds to a *unique* location in the output array. – Robert Crovella Dec 01 '15 at 15:43
Wow! your answer is correct. I did not think about that and this is the difference between the novice and expert! Thank you. I learned something great! – Minkyu Choi Dec 02 '15 at 01:25
@MinkyuChoi on StackOverflow, it is customary to acknowledge the correctness of an answer either by upvoting it (click on the voting arrow that points upward just above the `1` to the left of the answer provided by brano) and/or selecting the checkmark to make it green. Click ->[here](http://stackoverflow.com/help/someone-answers)<- for more information. – Robert Crovella Dec 02 '15 at 02:43

deathly809 · Answer 2 · 2015-12-01T08:14:45.403

0

You are going out of bounds in your shared memory access.

Assuming you are confident your program is more or less correct, you would need to make sure you don't go out of bounds:

if( (tx<O_TILE_WIDTH)&&(ty<O_TILE_WIDTH) ) {
    for(int i=0; i<MASK_WIDTH; i++) {
        if(ty +i < O_TILE_WIDTH) { // Changed here
            for(int j=0; j<MASK_WIDTH; j++) {
                if(tx +j < O_TILE_WIDTH) { // Changed here
                    output += Ns[ty+i][tx+j]*mask[i*MASK_WIDTH+j];
                }
            }
        }
    }
}

edited Dec 01 '15 at 08:14

answered Dec 01 '15 at 07:25

deathly809

384
4
11

Thanks for answer. However, I check the index again and there is no problem and even though I add if statement as you mentioned, the problem still exists. The funny thing is that if I add "printf statement" to the right below the code you wrote, it works perfectly. And if I delete it, then it does not work well. I am curious why it works in this way. – Minkyu Choi Dec 01 '15 at 09:01
@deathly809 I am trying to find the race condition for three days and I read so many times that I can now write this code even with my eyes closed. Could you give me a little more advice? For my knowledge, I cannot see the problem although I am doing my best to find the bug till now. – Minkyu Choi Dec 01 '15 at 13:28

Why the result of simple CUDA program differs every time?

2 Answers2