When you perform a wait-on-value operation using the CUDA driver API call cuStreamWaitValue32()
, you can specify the flag CU_STREAM_WAIT_VALUE_FLUSH
. Here's what the documentation says it does:
Follow the wait operation with a flush of outstanding remote writes. This means that, if a remote write operation is guaranteed to have reached the device before the wait can be satisfied, that write is guaranteed to be visible to downstream device work.
My question is: What counts as a "remote write" in this context? Is it only calls to cuStreamWriteValue32()
/ cuStreamWriteValue64()
? Is it any kind of write involving a different device or the host? Including cudaMemcpy()
and friends?