-1

I'm programming in cuda the next:

  1. I get a matrix with image values (d_Data)
  2. I copy the matrix in shared memory (tile)
  3. I get a basic pixel difference (pixel(i)-pixel(i+1)) (d_diff)
  4. If difference is a specific value for example 0, A number 1 is set i a matrix (d_A) in each position where the number 0 was found in d_diff. This is to be able to get the frequency of 0's in the original differences matrix.
  5. Parallel cumulative sum.
  6. Frequency result go to the frequencies vector.

Checking step by step, every thing goes as expected until the cumulative sum. When I launch the code the software-calculated value is 104347 but sometimes from CUDA I get a nan result, other times I get any number for example 2425. The very strange is if I persists running the kernel 20 or 30 times, the value becomes the expected 104347 :S.

I'm using for each matrix:

h_Data  = (float *)malloc(data_size);
h_diff  = (float *)malloc(data_size);
h_A         = (float *)malloc(data_size);

and

 cudaFree(d_A);
cudaFree(d_diff);
cudaFree(d_Av);

so I don't understand why the code is closer and closer to the correct result when i run enough times. By the way, when the correct value is reached, it doesn't move any more no matter how many times i run the code.

The code:

 __global__ void spam(float *d_Data, float *d_diff, float *d_A, int dw, int dh, float *d_Av){

long bx = blockIdx.x;  long by = blockIdx.y;
long tx = threadIdx.x; long ty = threadIdx.y;


// Identify the row and column of the Pd element to work on
long Row = by * TILE_WIDTH + ty;
long Col = bx * TILE_WIDTH + tx;
long tid = Row*dw+Col;
long i=512*512;
long r = MASK_DIM/2;
long s = 0;

 __shared__ int tile[BLOCK_WIDTH][BLOCK_WIDTH];

for (int k=0; k<=8; k++)
     d_Av[k]=0; 


    if(tid < dw*dh)
    {

   // to shared memory.
                                          tile[ty + r][tx + r]=d_Data[Row*dw+Col];
        if (Col-r >=0)                    tile[ty + r]  [tx] = d_Data[Row*dw+Col-r];
        if (Col+r <dw)                    tile[ty + r]  [tx + 2*r] = d_Data[Row*dw+Col+r];
        if (Row-r >=0)                    tile[ty]      [tx + r] = d_Data[(Row - r)*dw + Col];
        if (Row+r <dw)                    tile[ty + 2*r][tx + r] = d_Data[(Row + r)*dw + Col];
        if (Row - r >= 0 && Col - r >= 0) tile[ty]      [tx] = d_Data[(Row-r)*dw+Col-r];
        if(Row - r >= 0 && Col + r < dw)  tile[ty]      [tx + 2*r] = d_Data[(Row-r)*dw+Col+r];
        if (Row + r < dw && Col - r >= 0) tile[ty + 2*r][tx] = d_Data[(Row+r)*dw+Col-r];
        if(Row + r <dw && Col + r < dw)   tile[ty + 2*r][tx + 2*r] = d_Data[(Row-r)*dw+Col+r]; 

        //Calculates the difference matrix
       d_diff[tid] = (tile[ty + r][tx +r] - tile[ty + r][tx + r + 1]);


        d_A[tid]=0;

       //Set a 1 in each position in d_A where 0 was found in d_diff.
        if (d_diff[tid] == 0)
        { d_A[tid]=1;}
        __syncthreads();

        //cumulative sum to get the frecuency of value 0 in d_diff.  // The error is HERE
      for (s = (i/2); s>=1; s=s/2) {
            if (tid < s)
            {   d_A[tid] += d_A[tid+s];
            }
        } 

       // set the frequency value in frequencies vector.
        d_Av[0] = d_A[0];

}} // END IF tid < dw*dh

Any idea is welcome :D

1 Answers1

1

You can try replacing the if-statement with following code: d_A[tid] += d_A[tid+s] * (tid < s);

And be sure that this code do not lead to race condition. It often can be a case with parallel sum.

MK

Dori
  • 675
  • 1
  • 7
  • 26
  • Hi,@njuffa and @Robert Crovella. Thankyou for your help. I tried _syncthreads() after each massive access to d_A, d_diff or tile. The behavior remains the same :(. I'm not sure if really necessary a _syncthreads with d_A and d_diff since they are in global memory. – superjuanx Feb 22 '13 at 17:36
  • Thanks @Matso, I tried the change but every thing remains the same. Your comment is interesting when you mention the race conditions because it seems to be the cause since i get different results each run. (It does'n explain why the results are closer and closer to the correct result until reach it after 20 or 30 times i run the program). I tried a Thrust reduction and all is working fine (even faster :P); however i want to understand why my code is not working. I keep trying :) – superjuanx Feb 22 '13 at 17:44