__syncthreads seems not working

Question

I have following simple kernel that should compute sums of multiplications of a b arrays, but __syncthreads() seems not working at all, I debugged it and temp[i] return uninitialized values for some elements. If I omit the __syncthreads() the result is the same. (I checked and all others part of cuda code (like initialization of arrays, copying to memory, etc.) are written well, so the problem is in this kernel) (Note: I don't want to use atomicAdd)

 #include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h> 

#include "cuda_runtime.h"
#include "device_launch_parameters.h"  

#define MINREAL -1024.0
#define MAXREAL 1024.0

#define ACCURACY 0.01

#define NUM_OF_GPU_THREADS 256 

void checkCUDAError(const char *msg) {
    cudaError_t err = cudaGetLastError(); 
    if( cudaSuccess != err){
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 
        exit(EXIT_FAILURE); 
    } 
} 

void vecFillRand(int N, float *vec) {
    int i;
    for(i = 0; i < N; i++) 
        vec[i] = (rand() / (float) RAND_MAX)*(MAXREAL - MINREAL) + MINREAL;
}

float seq_dotProduct(float *a, float *b, int n) {
    int i;
    float dp;
    dp = 0;
    for(i = 0; i < n; i++) {
        dp += a[i] * b[i];
    }
    return dp;
}

// krenel
__global__ void dotProduct(float *a, float *b, float *c) {
    __shared__ float temp[NUM_OF_GPU_THREADS];
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    float t;

    if (idx >= gridDim.x * blockDim.x)
        return;

    temp[threadIdx.x] = a[idx] * b[idx];

    __syncthreads();

    if(threadIdx.x == 0) {
        c[blockIdx.x] = 0.0f;
        for(int i = 0; i < NUM_OF_GPU_THREADS; i++){
            t = temp[i];
            c[blockIdx.x] = c[blockIdx.x] + t;
        }
    }
}

int main(int argc, char* argv[]) {
    int i, n, ARRAY_BYTES;
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
    float sum;
    float seq_sum;
    clock_t t;

    srand(time(NULL));

    if (argc == 2) {
        n = atoi(argv[1]);
    } else {
        printf("N? ");
        fflush(stdout);
        scanf("%d", &n);    
    }

    int BLOCKS_PER_GRID = (unsigned int)ceil(n/(float)NUM_OF_GPU_THREADS);

    // arrays n host
    ARRAY_BYTES = n * sizeof(float);
    h_A = (float *) malloc(ARRAY_BYTES);
    h_B = (float *) malloc(ARRAY_BYTES);
    h_C = (float *) malloc(BLOCKS_PER_GRID * sizeof(float));
    printf("\ncreating A and B...\n\n");
    vecFillRand(n, h_A);
    vecFillRand(n, h_B);
    vecFillRand(BLOCKS_PER_GRID, h_C);

    // arrays on device
    cudaMalloc((void**) &d_A, ARRAY_BYTES);
    cudaMalloc((void**) &d_B, ARRAY_BYTES);
    cudaMalloc((void**) &d_C, BLOCKS_PER_GRID * sizeof(float));

    // transfer the arrays to the GPU
    cudaMemcpy(d_A, h_A, ARRAY_BYTES, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, ARRAY_BYTES, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, h_C, BLOCKS_PER_GRID, cudaMemcpyHostToDevice);

    // TIME START
    // create events for timing execution 
    cudaEvent_t start = cudaEvent_t(); 
    cudaEvent_t stop = cudaEvent_t(); 
    cudaEventCreate( &start ); 
    cudaEventCreate( &stop ); 
    // record time into start event 
    cudaEventRecord( start, 0 ); // 0 is the default stream id 

    // launch the kernel
    dim3 block(NUM_OF_GPU_THREADS); // 256, 1, 1
    dim3 grid(BLOCKS_PER_GRID);  
    printf ("computing dotProduct... \n");
    dotProduct<<<grid, block>>>(d_A, d_B, d_C);

    // block until the device has completed 
    cudaThreadSynchronize();    

    // check if kernel execution generated an error 
    // Check for any CUDA errors 
    checkCUDAError("kernel invocation"); 

    // TIME END
    // record time into stop event 
    cudaEventRecord( stop, 0 ); 
    // synchronize stop event to wait for end of kernel execution on stream 0 
    cudaEventSynchronize( stop ); 
    // compute elapsed time (done by CUDA run-time) 
    float elapsed_kernel = 0.f; 
    cudaEventElapsedTime( &elapsed_kernel, start, stop ); 
    // release events 
    cudaEventDestroy( start ); 
    cudaEventDestroy( stop ); 
    // print krenel time
    printf("CUDA TIME: %f \n\n", elapsed_kernel/1000);

    // copy back the result array to the CPU
    cudaMemcpy(h_C, d_C, BLOCKS_PER_GRID * sizeof(float), cudaMemcpyDeviceToHost );

    // Check for any CUDA errors 
    checkCUDAError("memcpy");

    // compute sum
    sum = 0;
    for (i = 0; i < BLOCKS_PER_GRID; i++)
        sum += h_C[i];

    //  launch sequential
    t = clock();
    printf ("computing seq_dotProduct... \n");
    seq_sum = seq_dotProduct(h_A, h_B, n);
    t = clock() - t;
    printf ("SEQ TIME: %f \n\n",((float)t)/CLOCKS_PER_SEC);

    // check sum and seq_sum    
    float value = abs(sum - seq_sum);
    if (value > ACCURACY) {
        printf("Test FAILED: %f \n", value);        
    }
    else{
        printf("Test PASSED \n");
    } 

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    free(h_A);
    free(h_B);
    free(h_C);  

    return 0;
}

What's the purpose of `if (idx >= gridDim.x) return;`? Why do you want to return the threads greater than the number of blocks in the grid? I think the problem is here. Plus, instead of having a `for` loop for summation, you can use parallel reduction technique. — Farzad, Jan 13 '14 at 05:18
Yes, you are right, it should be if (idx >= gridDim.x * blockDim.x), it seems that it was problem, but I am not sure yey, I am testing it now. Can you show how can I use parallel reduction? — dede, Jan 13 '14 at 05:34
As @Farzad said, your code almost certainly has a variety of errors. But it's impossible to be explicit in telling you how to fix them unless you provide a complete code, so we can understand your usage of threads and blocks in the mapping of your vectors. In fact, [SO expects this](http://stackoverflow.com/help/on-topic). For example, this code only seems to make sense if you call it with a single threadblock. Is that your intent? — Robert Crovella, Jan 13 '14 at 05:35
All right, so here is the whole code. Now after fixing, when I compare cuda and sequential sum a I got different sums. So I have no idea why is that now? — dede, Jan 13 '14 at 05:50
Try running your code with `cuda-memcheck`. You'll see lots of errors. — Robert Crovella, Jan 13 '14 at 06:01
For N=64 and for N<256 I always got "Test PASSED". But for anything larger than 256 I usually got "Test FAILED", and sometimes but rarely "Test PASSED". Totally wierd — dede, Jan 13 '14 at 06:09
@Rober Where to put cuda-memcheck, I use VS2010. Is it compile parameter? — dede, Jan 13 '14 at 06:13
build your code, i.e. create an executable that can be run from the command line. Then open a console, change directory to where your executable is, then run `cuda-memcheck my_executable` from the command line. — Robert Crovella, Jan 13 '14 at 06:27

Robert Crovella · Accepted Answer · 2014-01-14T20:56:31.100

Your kernel needs to be re-written slightly.

The "last block" will kill any threads that exceed the length of the input vectors, but the for loop that sums the elements for the last block does not properly check to make sure it does not exceed the length of the vector. This is leading to out-of-bounds read accesses, which will show up when you run your code with cuda-memcheck (and assuming you input a vector size that is not a multiple of 256).

Furthermore, __syncthreads() should not be used in conditional code, unless the condition evaluates the same across all threads in the block. Your last block will violate this rule, for vector lengths that are not a multiple of 256.

Beyond that, for larger vector sizes, you're expecting too much (too many digits of) accuracy out of a float quantity. You'll need to scale your ACCURACY test by the number of float values you are summing together.

Here's a modified version of your code that includes the changes I mention above, which seems to work correctly for me:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

//#include "cuda_runtime.h"
//#include "device_launch_parameters.h"

#define MINREAL -1024.0
#define MAXREAL 1024.0
#define FAST_RED
#define ACCURACY 0.0001

#define NUM_OF_GPU_THREADS 256

void checkCUDAError(const char *msg) {
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err){
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }
}

void vecFillRand(int N, float *vec) {
    int i;
    for(i = 0; i < N; i++)
        vec[i] = (rand() / (float) RAND_MAX)*(MAXREAL - MINREAL) + MINREAL;
}

float seq_dotProduct(float *a, float *b, int n) {
    int i;
    float dp;
    dp = 0;
    for(i = 0; i < n; i++) {
        dp += a[i] * b[i];
    }
    return dp;
}

// krenel
__global__ void dotProduct(float *a, float *b, float *c, int n) {
    __shared__ float temp[NUM_OF_GPU_THREADS];
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < n)
      temp[threadIdx.x] = a[idx] * b[idx];
    else temp[threadIdx.x] = 0.0f;

    __syncthreads();
#ifdef FAST_RED
    // assumes block dimension is a power of 2
    for (int i = blockDim.x>>1; i > 0; i >>= 1){
      if (threadIdx.x < i) temp[threadIdx.x] += temp[threadIdx.x+i];
      __syncthreads();}
    if (threadIdx.x == 0) c[blockIdx.x] = temp[0];
#else
    float t;
    if(threadIdx.x == 0) {
        c[blockIdx.x] = 0.0f;
        int j=0;
        for(int i = blockIdx.x*blockDim.x; ((i < ((blockIdx.x+1)*blockDim.x)) && (i < n)); i++){
            t = temp[j++];
            c[blockIdx.x] = c[blockIdx.x] + t;
        }
    }
#endif
}

int main(int argc, char* argv[]) {
    int i, n, ARRAY_BYTES;
    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
    float sum;
    float seq_sum;
    clock_t t;

    srand(time(NULL));

    if (argc == 2) {
        n = atoi(argv[1]);
    } else {
        printf("N? ");
        fflush(stdout);
        scanf("%d", &n);
    }

    int BLOCKS_PER_GRID = (unsigned int)ceil(n/(float)NUM_OF_GPU_THREADS);
    printf("bpg = %d\n", BLOCKS_PER_GRID);

    // arrays n host
    ARRAY_BYTES = n * sizeof(float);
    h_A = (float *) malloc(ARRAY_BYTES);
    h_B = (float *) malloc(ARRAY_BYTES);
    h_C = (float *) malloc(BLOCKS_PER_GRID * sizeof(float));
    printf("\ncreating A and B...\n\n");
    vecFillRand(n, h_A);
    vecFillRand(n, h_B);
    vecFillRand(BLOCKS_PER_GRID, h_C);

    // arrays on device
    cudaMalloc((void**) &d_A, ARRAY_BYTES);
    cudaMalloc((void**) &d_B, ARRAY_BYTES);
    cudaMalloc((void**) &d_C, BLOCKS_PER_GRID * sizeof(float));

    // transfer the arrays to the GPU
    cudaMemcpy(d_A, h_A, ARRAY_BYTES, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, ARRAY_BYTES, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, h_C, BLOCKS_PER_GRID * sizeof(float), cudaMemcpyHostToDevice);

    // TIME START
    // create events for timing execution
    cudaEvent_t start = cudaEvent_t();
    cudaEvent_t stop = cudaEvent_t();
    cudaEventCreate( &start );
    cudaEventCreate( &stop );
    // record time into start event
    cudaEventRecord( start, 0 ); // 0 is the default stream id

    // launch the kernel
    dim3 block(NUM_OF_GPU_THREADS); // 256, 1, 1
    dim3 grid(BLOCKS_PER_GRID);
    printf ("computing dotProduct... \n");
    dotProduct<<<grid, block>>>(d_A, d_B, d_C, n);

    // block until the device has completed
    cudaDeviceSynchronize();

    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");

    // TIME END
    // record time into stop event
    cudaEventRecord( stop, 0 );
    // synchronize stop event to wait for end of kernel execution on stream 0
    cudaEventSynchronize( stop );
    // compute elapsed time (done by CUDA run-time)
    float elapsed_kernel = 0.f;
    cudaEventElapsedTime( &elapsed_kernel, start, stop );
    // release events
    cudaEventDestroy( start );
    cudaEventDestroy( stop );
    // print krenel time
    printf("CUDA TIME: %f \n\n", elapsed_kernel/1000);

    // copy back the result array to the CPU
    cudaMemcpy(h_C, d_C, BLOCKS_PER_GRID * sizeof(float), cudaMemcpyDeviceToHost );

    // Check for any CUDA errors
    checkCUDAError("memcpy");

    // compute sum
    sum = 0;
    for (i = 0; i < BLOCKS_PER_GRID; i++)
        sum += h_C[i];

    //  launch sequential
    t = clock();
    printf ("computing seq_dotProduct... \n");
    seq_sum = seq_dotProduct(h_A, h_B, n);
    t = clock() - t;
    printf ("SEQ TIME: %f \n\n",((float)t)/CLOCKS_PER_SEC);

    // check sum and seq_sum
    float value = abs((sum - seq_sum)/sum);
    if (value > ACCURACY) {
        printf("Test FAILED: err: %f cpu: %f  gpu: %f \n", value, seq_sum, sum);
    }
    else{
        printf("Test PASSED \n");
    }

    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}

EDIT: In response to a question below, I've modified the code to demonstrate a rudimentary parallel reduction.

great, it works now. I am now interested in how to improve kernel with reduction, can you suggest me some solution? — dede, Jan 14 '14 at 20:15
For general understanding of parallel reduction, study the [CUDA parallel reduction sample code and presentation.](http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction) — Robert Crovella, Jan 14 '14 at 20:39
I've modified the code I posted in the answer to demonstrate a rudimentary parallel reduction approach. — Robert Crovella, Jan 14 '14 at 20:57
I now see that you put float value = abs((sum - seq_sum)/sum); instead of just float value = abs(sum - seq_sum); In that case this code doesn't work, I got TEST failed for every value. I din't check with reduction version, but in original it doesn't work — dede, Jan 14 '14 at 22:05
As I indicated, you cannot expect numerical equality from two `float` results generated by two different machines (CPU and GPU). There are a variety of reasons for this. Nevertheless, as a general principal, when comparing floating point numbers, we must keep track of how many digits we are comparing. The final digits will vary, but this does not invalidate the results. That is why I made the change in the comparison, as I indicated in my original response. If you are expecting numerical identity to the last binary digit between the two results, you will not get it. — Robert Crovella, Jan 14 '14 at 22:38
Take a look at [this question](http://stackoverflow.com/questions/21130121/pycuda-precision-of-matrix-multiplication-code) which has a similar issue. You need to test the *relative error* when looking at floating point comparisons, not the *absolute error*. — Robert Crovella, Jan 15 '14 at 14:53

__syncthreads seems not working

1 Answers1