0

I am currently learning CUDA streams through the computation of a dot product between two vectors. The ingredients are a kernel function that takes in vectors x and y and returns a vector result of size equal to the number of blocks, where each block contributes its own reduced sum.

I also have a host function dot_gpu that calls the kernel and reduces the vector result to the final dot product value.

The synchronous version does just this:

// copy to device
copy_to_device<double>(x_h, x_d, n);
copy_to_device<double>(y_h, y_d, n);

// kernel           
double result = dot_gpu(x_d, y_d, n, blockNum, blockSize); 

while the async one goes like:

double result[numChunks];
for (int i = 0; i < numChunks; i++) {
    int offset = i * chunkSize;

    // copy to device
    copy_to_device_async<double>(x_h+offset, x_d+offset, chunkSize, stream[i]);
    copy_to_device_async<double>(y_h+offset, y_d+offset, chunkSize, stream[i]);

    // kernel
    result[i] = dot_gpu(x_d+offset, y_d+offset, chunkSize, blockNum, blockSize, stream[i]);
}
for (int i = 0; i < numChunks; i++) {
    finalResult += result[i];
    cudaStreamDestroy(stream[i]);
}

I am getting worse performance when using streams and was trying to investigate the reasons. I tried to pipeline the downloads, kernel calls and uploads, but with no results.

// accumulate the result of each block into a single value
double dot_gpu(const double *x, const double* y, int n, int blockNum, int blockSize, cudaStream_t stream=NULL)
{
double* result = malloc_device<double>(blockNum);
dot_gpu_kernel<<<blockNum, blockSize, blockSize * sizeof(double), stream>>>(x, y, result, n);

#if ASYNC
    double* r = malloc_host_pinned<double>(blockNum);
    copy_to_host_async<double>(result, r, blockNum, stream);

    CudaEvent copyResult;
    copyResult.record(stream);
    copyResult.wait();
#else
    double* r = malloc_host<double>(blockNum);
    copy_to_host<double>(result, r, blockNum);
#endif

double dotProduct = 0.0;
for (int i = 0; i < blockNum; i ++) {
    dotProduct += r[i];
}

cudaFree(result);
#if ASYNC
    cudaFreeHost(r);
#else
    free(r);
#endif

return dotProduct;
}

My guess is that the problem is inside the dot_gpu() functions that doesn't only call the kernel. Tell me if I understand correctly the following stream executions

foreach stream {
    cudaMemcpyAsync( device[stream], host[stream], ... stream );
    LaunchKernel<<<...stream>>>( ... );
    cudaMemcpyAsync( host[stream], device[stream], ... stream );
}

The host executes all the three instructions without being blocked, since cudaMemcpyAsync and kernel return immediately (however on the GPU they will execute sequentially as they are assigned to the same stream). So host goes on to the next stream (even if stream1 who knows what stage it is at, but who cares.. it's doing his job on the GPU, right?) and executes the three instructions again without being blocked.. and so on and so forth. However, my code blocks the host before it can process the next stream, somewhere inside the dot_gpu() function. Is it because I am allocating & freeing stuff, as well as reducing the array returned by the kernel to a single value?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
iNvId
  • 1

1 Answers1

1

Assuming your objectified CUDA interface does what the function and method names suggest, there are three reasons why work from subsequent calls to dot_gpu() might not overlap:

  1. Your code explicitly blocks by recording an event and waiting for it.

  2. If it weren't blocking for 1. already, your code would block on the pinned host side allocation and deallocation, as you suspected.

  3. If your code weren't blocking for 2. already, work from subsequent calls to dot_gpu() might still not overlap depending on compute capbility. Devices of compute capability 3.0 or lower do not reorder operations even if they are enqueued to different streams.

    Even for devices of compute capability 3.5 and higher the number of streams whose operations can be reordered is limited by the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable, which defaults to 8 and can be set to values as large as 32.

tera
  • 7,080
  • 1
  • 21
  • 32