1

I am trying to accelerate the following bit of CUDA code by using multiple streams.

#define N (4096 * 4096)
#define blockDimX  16
#define blockDimY  16

float domain1 [N];
float domain2 [N];

__global__ updateDomain1_kernel(const int dimX, const int dimY) {
    // update mechanism here for domain1
    // ...
}

__global__ updateDomain2_kernel(const int dimX, const int dimY) {
    // update mechanism here for domain2, which is nearly the same
    // ...
}

__global__ addDomainsTogether_kernel(float* domainOut, 
                                     const int dimX, 
                                     const int dimY) 
{
    // add domain1 and domain2 together and fill domainOut
}

void updateDomains(float* domainOut) {
    dim3 blocks((dimX + blockDimX - 1) / blockDimX , (dimY + blockDimY- 1) / blockDimY);
    dim3 threads(blockDimX, blockDimY);

    updateDomain1_kernel<<<blocks, threads>>> (dimX, dimY);
    updateDomain2_kernel<<<blocks, threads>>> (dimX, dimY);
    addDomainsTogether_kernel<<<block, threads>>> (domainOut_gpu, dimX, dimY);
    cudaMemcpy(domainOut, domainOut_gpu, N * sizeof(float), cudaMemcpyDeviceToHost);
}

The precise implementation doesn't really matter; what's important is that updating the respective domains are two completely independent operations, after which both are used in the third kernel call. Hence I thought it a good idea to try to accelerate it by putting each update kernel in its own stream, which I want to run simultaneously. So I changed it to the following:

void updateDomains(float* domainOut) {
    dim3 blocks((dimX + blockDimX - 1) / blockDimX , (dimY + blockDimY- 1) / blockDimY);
    dim3 threads(blockDimX, blockDimY);

    cudaStream_t stream0, stream1;
    cudaStreamCreate(&stream0);
    cudaStreamCreate(&stream1);

    updateDomain1_kernel<<<blocks, threads, 0, stream0>>> (dimX, dimY);
    updateDomain2_kernel<<<blocks, threads, 0, stream1>>> (dimX, dimY);
    cudaDeviceSynchronize();

    addDomainsTogether_kernel<<<block, threads>>> (domainOut_gpu, dimX, dimY);
    cudaMemcpy(domainOut, domainOut_gpu, N * sizeof(float), cudaMemcpyDeviceToHost);

    cudaStreamDestroy(stream0);
    cudaStreamDestroy(stream1);
}

I presumed to find a difference in performance speed, but there is absolutely no noticeable difference. So thinking that maybe the compiler was being smart the first time by automatically scheduling the update calls at the same time, I assumed that the following should slow down the performance:

void updateDomains(float* domainOut) {
    dim3 blocks((dimX + blockDimX - 1) / blockDimX , (dimY + blockDimY- 1) / blockDimY);
    dim3 threads(blockDimX, blockDimY);

    cudaStream_t stream0;
    cudaStreamCreate(&stream0);

    updateDomain1_kernel<<<blocks, threads, 0, stream0>>> (dimX, dimY);
    updateDomain2_kernel<<<blocks, threads, 0, stream0>>> (dimX, dimY);

    addDomainsTogether_kernel<<<block, threads0, stream0>>> (domainOut_gpu, dimX, dimY);
    cudaMemcpy(domainOut, domainOut_gpu, N * sizeof(float), cudaMemcpyDeviceToHost);

    cudaStreamDestroy(stream0);
}

However, again there is hardly any difference in performance speed. If anything, the last one seems fastest. Which makes me think there is something about CUDA streams I do not understand. Can someone enlighten me on how accelerate this code?

Yellow
  • 3,955
  • 6
  • 45
  • 74
  • What is your compute capability? There's plenty of rules to respect in order for streams to work properly, I suggest you read [this](http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf). – user703016 Mar 05 '13 at 09:09
  • I had read this document, but it mainly talks about data partitioning, which confused me a little. I'm using compute capability 2.0, by the way. – Yellow Mar 05 '13 at 14:04

1 Answers1

1

Increased parallelism only increases your computational throughput if you weren't already using all the cores available. If you already had sufficient parallelism, it won't help you except to increase your synchronization overheads.

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • But how do I know if I have sufficient parallelism? I.e. how do I know that the first example is sufficient? And does this mean that the second implementation might yield improved performance on a different computer (with a better GPU), or is something just wrong in the code? – Yellow Mar 04 '13 at 19:32
  • 2
    This is a good answer. To add some specifics, the grid for these kernels is (4096/16)*(4096/16) = 65536 threadblocks total. This many threadblocks in a single kernel launch will completely fill the machine, preventing any threadblocks from subsequent kernel launches from executing, until nearly all of the threadblocks from the first kernel launch are drained. This question is a duplicate of several others, such as [this one](http://stackoverflow.com/questions/14124673/cuda-kernels-not-executing-concurrently/14271418#14271418). – Robert Crovella Mar 04 '13 at 19:37
  • Hmmm, I have some doubts still. If what you say is right, I would expect that for a smaller size of `N`, the GPU isn't full yet and the streams can be executed in parallel. But however small I make the domain, this doesn't happen: I analysed the performance in NSight and both streams are always run consecutively no matter the size of `N`. Is there another way to ensure they run in parallel? – Yellow Mar 05 '13 at 15:46
  • 2
    For an obvious concurrent kernel scenario, each kernel could only have "a few blocks" and this will depend on which GPU you are running on as well. Did you try small N (eg. 32*32)? Such a kernel may run so quick it would be hard to observe overlap anyway. If you really want to observe concurrent kernel execution, windows can interfere with timeline sequencing, and under linux, nsight EE may serialize in visual profiler. Also read [the requirements](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#conurrent-kernel-execution). This probably can't be covered in comments. – Robert Crovella Mar 05 '13 at 18:46