Cannot use cuMemcpyHtoDAsync and cuMemcpyDtoHAsync at the same time

Question

I have a rather strange observation on the following code snippet.

When I do both - copy memory to device and copy results back to host the streams seem to be synronized - i.e. they execute the kernel sequentially. Once I remove the copy to host and keep copy the parameters to the device the streams execute in parallel, once I remove copying the parameters and keep copying the results the streams also execute in parallel.

Any Idea why? and how to solve the problem?

for (int j=0; j<n_streams; j++) {
    cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
    Pointer kernelParameters1 = Pointer.to(
            Pointer.to(new int[]{0}),
            Pointer.to(new int[] {10000}),
            Pointer.to(gpu_data),
            Pointer.to(gpu_results[j]),
            Pointer.to(gpu_parameters[j])
            );
    cuLaunchKernel(function[j],
            s_grid, 1, 1,      // Grid dimension
            s_block, 1, 1,      // Block dimension
            0, stream[j],               // Shared memory size and stream
            kernelParameters1, null // Kernel- and extra parameters
            );
    cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}

Unless you have a GPU with two copy engines, this is a hardware limitation which can't be magically programming around — talonmies, Jun 03 '20 at 10:47
oh...even if the runtime of the kernel is 1000-times higher than the time needed for copying? — eLe, Jun 03 '20 at 10:57
and yes: device query tells me: "Concurrent copy and kernel execution: Yes with 1 copy engine(s)" — eLe, Jun 03 '20 at 11:06
Eventually two opposing direction transfers will block one another and then things will serialize until both are popped off the stream. If you have two dma engines to then two directions can overlap and stalling will be eliminated — talonmies, Jun 03 '20 at 11:12
Its still not obvious to me, why this happens. Same number of calls to the kernel using 1 stream takes approx twice as much time than doing that with 2 streams as long as copying is only be done in one direction. Seems that switching the memory-transfer-direction does a device-Synchronization.... — eLe, Jun 03 '20 at 11:37
The driver pops work off stream queues according to available resources. How isn't documented. But if you have a situation where the next operations on both stream queues are memcpys in opposite directions with only one copy engine, the dependency heuristics in the scheduler will serialize the queues. If you go breadth first in scheduling, like in your answer, then you are much less likely to trigger serialization — talonmies, Jun 05 '20 at 08:22

score 1 · Answer 1 · answered Jun 03 '20 at 12:02

No Idea why ... but changing the sequence removed the problem - and is executing in parallel....

for (int j=0; j<n_streams; j++) {
    cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
}
for (int j=0; j<n_streams; j++) {
    Pointer kernelParameters1 = Pointer.to(
            Pointer.to(new int[]{0}),
            Pointer.to(new int[] {getNPrices()}),
            Pointer.to(get_gpu_prices()),
            Pointer.to(gpu_results[j]),
            Pointer.to(gpu_parameters[j])
            //,Pointer.to(new int[]{0})
            );
    cuLaunchKernel(function[j],
            s_grid, 1, 1,      // Grid dimension
            s_block, 1, 1,      // Block dimension
            0, stream[j],               // Shared memory size and stream
            kernelParameters1, null // Kernel- and extra parameters
            );
}
for (int j=0; j<n_streams; j++) {
    cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}

depth first versus breadth first behaviour in the queuing system. — talonmies, Jun 03 '20 at 12:14
still I'd like to understand why the kernels are executed serialized in the first attempt. My real use case is unfortunately slightly different - there I hardly can control the sequence of adding the commands to the streams because i have a host-thread for each stream. Thus scheduling the copyToDevice-executeKernel-copyToHost would alwayse be done quickly one after each other — eLe, Jun 03 '20 at 13:23
number of streams: tried 2....10 - always same phenomenon. Regarding the use case where each stream is filled by a separate thread: the streams (mainly) run in parallel once I have added a Thread.sleep() of about 60% of the expected execution time for the kernel execution before scheduling the cuMemcpyDtoHAsync. — eLe, Jun 07 '20 at 19:09

Cannot use cuMemcpyHtoDAsync and cuMemcpyDtoHAsync at the same time

1 Answers1