0

I would like to copy memory between two CUDA devices (with UVA support) by calling cudaMemcpy. I was wondering if the call is synchronous with respect to the host? I'm aware that cudaMemcpy within the same device is asynchronous, but what about the copy between different devices? Do I need to call cudaDeviceSynchronize to make sure that copying has finished, or its ensured automatically?

I also have a similar question about cublas. I'd like to add a vector stored on one device to the vector stored on another, so I'm calling cublasSaxpy for that. Will it block the host until the operation is finished, or I need to synchronize explicitly?

simon
  • 1

1 Answers1

1

I'm aware that cudaMemcpy within the same device is asynchronous

The documentation says, "This function exhibits synchronous behavior for most use cases." (my emphasis). However, though cudaMemcpy() does exhibit asynchronous behavior in some corner cases, those same corner cases have behavior that negates that behavior. The end result is that you can rely on cudaMemcpy() being synchronous, also when doing peer-to-peer copies.

If you need asynchronous behavior, you should call cudaMemcpyAsync().

The CUBLAS API is asynchronous for the most part, including cublasSaxpy. The exception is some of the calls that return scalars.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • Actually, [NVIDIA documentation](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device) says `Memory copies between two addresses to the same device memory;` is asynchronous. But in my case I'm doing peer-to-peer copy, so I assume it's synchronous. `The CUBLAS API is asynchronous for the most part, including cublasSaxpy` Even when one the arguments is stored on another device? – simon Mar 15 '14 at 18:41
  • @simon, thank you for the correction. Odd that that the programming guide has more information about this than the reference. That certainly sounds like it could trip someone up. About `cublasSaxpy` with arguments stored on another device, I don't know. Will that actually work? If it does, you could find out by timing your calls. Just compare the timing of `cublasSaxpy` alone and one together with a `cudaDeviceSynchronize()`. – Roger Dahl Mar 16 '14 at 00:32