Is cuda.to_device asynchronous?

Question

Does cuda.to_device use the same stream as kernel launches?

It seems that memcpy is synchronous (with respect to the host).

from numba import cuda
import numpy as np

A = np.ones((10000, 10000))

%timeit cuda.to_device(A)

188 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit cuda.synchronize()

14.5 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit -n1 cuda.to_device(A)
cuda.synchronize()

82.6 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

If cuda.to_device is synchronous, why would synchronization take longer than 14.5 µs? (synchronous with respect to host ≠ gpu is done?)

The results are similar if I explicitly provide a stream.

stream = cuda.stream()
%timeit cuda.to_device(A, stream=stream)

188 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n1 cuda.to_device(A, stream=stream)
cuda.synchronize()

82.9 µs ± 6.66 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

"To enqueue the transfer to a stream" makes me think the work of transferring the data is delegated to the stream, in which case simply calling cuda.to_device should be faster (since it would return immediatly).

%%timeit -n1 cuda.synchronize()
cuda.to_device(A, stream=stream)

186 ms ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit: I launched many kernels to confirm that to_device is using the same stream.

start = time.time()

for i in range(1000):
  matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)

print(f'launched kernels: {time.time() - start}')
cuda.to_device(A) # takes a while since waiting for queue to free up
print(f'transferred: {time.time() - start}')
cuda.synchronize() 
print(f'synchronized: {time.time() - start}') # time is almost the same, meaning cuda.to_device waited for other kernels to finish

launched kernels: 0.16392898559570312
transferred: 29.859858512878418
synchronized: 29.860819101333618

Edit 2: There is a staging buffer between the CPU and GPU. to_device returns after sending the data to the staging buffer, and the additional latency comes from the staging buffer sending the data to the GPU. Not sure why staged->GPU (82.6 µs - 14.5 µs) is so much faster than CPU->staged (188 ms).

score 3 · Accepted Answer · answered Feb 28 '23 at 14:57

3

In CUDA, explicitly providing a stream is not sufficient to get a transfer to be asynchronous.

Generally speaking, you also need to use pinned memory. If the transfer is between a device allocation and a pinned host allocation, and you specify a non-null stream, the transfer should meet the requirements to become asynchronous.

You can find more info and examples here.

in which case simply calling cuda.to_device should be faster.

Let's make sure we have clarity there for future readers. The transfer activity itself is unlikely to be faster simply by enqueueing it in a stream. The thing that you're suggesting will be faster is the apparent time it takes from an API perspective. A properly set up async transfer (using a created stream, to or from pinned memory) should only appear to take the time it takes to put the operation into the stream, which may not include the copy time. Conversely, if you don't meet the requirements for an async transfer, then from an API perspective, you are going to witness the actual transfer time, in the API call itself.

The actual transfer, whenever it may occur, is not likely to be any quicker just because you did or didn't put it into a stream.

answered Feb 28 '23 at 14:57

Robert Crovella

143,785
11
213
257

"Once the data transfer is complete, the CPU thread is allowed to proceed." [source](https://stackoverflow.com/a/67986409/20898396) If the transfer is complete after `to_device` returns, what could cause the additional synchronization latency (14.5 µs when `synchronize` called alone->82.6 µs when called after `to_device`)? – BPDev Mar 01 '23 at 22:29
1

The thing that is reporting 82.6us is not a cuda.synchronize() call. Please re-read the code you posted. timeit is measuring `cuda.to_device()` not `cuda.synchronize()` Furthermore, the conversion of an async transfer to a sync transfer is not as simple behaviorally as we are modelling. The transfer will return when the final packet copy from non-pinned to pinned memory occurs, not necessarily when the transfer is complete. [source](https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior) " but the DMA to final destination may not have completed." – Robert Crovella Mar 02 '23 at 02:00
1

you're also mixing up async API behavior with sync API behavior. `cudaMemcpyAsync` is not the same as `cudaMemcpy`. Note that `cudaMemcpy` does not take a stream parameter. It behaves according to null stream semantics which have a synchronization that has nothing to do with the async conversion to synchronous. Since you are passing a stream parameter, the underlying API could not be `cudaMemcpy`, therefore your source/reference is irrelevant. – Robert Crovella Mar 02 '23 at 02:07

Is cuda.to_device asynchronous?

1 Answers1