Does cuda.to_device
use the same stream as kernel launches?
It seems that memcpy is synchronous (with respect to the host).
from numba import cuda
import numpy as np
A = np.ones((10000, 10000))
%timeit cuda.to_device(A)
188 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit cuda.synchronize()
14.5 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n1 cuda.to_device(A)
cuda.synchronize()
82.6 µs ± 11.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
If cuda.to_device
is synchronous, why would synchronization take longer than 14.5 µs? (synchronous with respect to host ≠ gpu is done?)
The results are similar if I explicitly provide a stream.
stream = cuda.stream()
%timeit cuda.to_device(A, stream=stream)
188 ms ± 4.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n1 cuda.to_device(A, stream=stream)
cuda.synchronize()
82.9 µs ± 6.66 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
"To enqueue the transfer to a stream" makes me think the work of transferring the data is delegated to the stream, in which case simply calling cuda.to_device
should be faster (since it would return immediatly).
%%timeit -n1 cuda.synchronize()
cuda.to_device(A, stream=stream)
186 ms ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edit:
I launched many kernels to confirm that to_device
is using the same stream.
start = time.time()
for i in range(1000):
matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
print(f'launched kernels: {time.time() - start}')
cuda.to_device(A) # takes a while since waiting for queue to free up
print(f'transferred: {time.time() - start}')
cuda.synchronize()
print(f'synchronized: {time.time() - start}') # time is almost the same, meaning cuda.to_device waited for other kernels to finish
launched kernels: 0.16392898559570312
transferred: 29.859858512878418
synchronized: 29.860819101333618
Edit 2:
There is a staging buffer between the CPU and GPU. to_device
returns after sending the data to the staging buffer, and the additional latency comes from the staging buffer sending the data to the GPU. Not sure why staged->GPU (82.6 µs - 14.5 µs) is so much faster than CPU->staged (188 ms).