In CUDA(driver API) documentation, it says
The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It thus synchronizes streams which have been "joined" prior to the callback.
Does this mean that if I have a pipeline with callbacks after each critical point to signal host, I don't need any cuStreamSynchronize for those points to be able to access output arrays?
Very simple code like
cuda memcpy host to device
cuda launch kernel
cuda memcpy device to host
add callback
callback()
{
here, safe to access host "results" array?
(considering no more cuda commands on these arrays)
}