Overlapping transfers and kernel executions in CUDA with two loops

Question

I want to overlap data transfers and kernel executions in a form like this:

int numStreams = 3;
int size = 10;

for(int i = 0; i < size; i++) {
    cuMemcpyHtoDAsync( _bufferIn1,
                           _host_memoryIn1 ),
                           _size * sizeof(T),
                           cuda_stream[i % numStreams]);

    cuMemcpyHtoDAsync( _bufferIn2,
                           _host_memoryIn2,
                           _size * sizeof(T),
                           cuda_stream[i % numStreams]);

        cuLaunchKernel( _kernel,
                        gs.x(), gs.y(), gs.z(),
                        bs.x(), bs.y(), bs.z(),
                        _memory_size,
                        cuda_stream[i % numStreams],
                        _kernel_arguments,
                        0
                      );
      cuEventRecord(event[i], cuda_stream);
}

for(int i = 0; i < size; i++) {
    cuEventSynchronize(events[i]);

    cuMemcpyDtoHAsync( _host_memoryOut,
                           _bufferOut,
                           _size * sizeof(T),
                           cuda_stream[i % numStreams]);
}

Is overlapping possible in this case? Currently only the HtoD-transfers overlap with the kernel executions. The first DtoH-transfer is executed after the last kernel execution.

You're trying to record an event on an _array_ of streams... Also, what overlap would you _like_ to have? It's impossible to figure that out from just the code you've posted. — einpoklum, Apr 16 '20 at 22:41

score 2 · Answer 1 · answered Apr 16 '20 at 22:34

2

Overlapping is possible only when the operations are executed on different streams. CUDA operations in the same stream are executed sequentially by the host calling order so that the copy from the device to host at the end will be executed once all the operations on corresponding streams are completed. The overlap doesn't happen because both the last kernel and the first copy are executed on stream 0, so the copy has to wait for the kernel to finish. Since you are synchronizing with an event at each loop iteration, the other copies on the other streams (stream 1 and 2) are not called yet.

answered Apr 16 '20 at 22:34

heapoverflow

264
2
12

Thanks! Is overlapping possible if the number of streams were equal to the number of loop iterations? So that I only enqueue HtoD-transfer, kernel execution and DtoH-transfer in each stream. – Eagle06 Apr 17 '20 at 08:05
1

Depends on what kind of overlapping you want to achieve and how you use the streams and synchronize. You can read this post to get a better idea of how it works: https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/ – heapoverflow Apr 17 '20 at 09:15

Overlapping transfers and kernel executions in CUDA with two loops

1 Answers1