I would like to copy memory between two CUDA devices (with UVA support) by calling cudaMemcpy
. I was wondering if the call is synchronous with respect to the host? I'm aware that cudaMemcpy
within the same device is asynchronous, but what about the copy between different devices? Do I need to call cudaDeviceSynchronize
to make sure that copying has finished, or its ensured automatically?
I also have a similar question about cublas. I'd like to add a vector stored on one device to the vector stored on another, so I'm calling cublasSaxpy
for that. Will it block the host until the operation is finished, or I need to synchronize explicitly?