I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel. my requirement is to use single stream and copy back to the host in parallel.
How do you manage this kind of issues ?
SomeCudaCall<<<25,34>>>(input, dev_out_1,dev_out_2);
cudaMemcpyAsync(toHere_1, dev_out_1, sizeof(int), cudaMemcpyDeviceToHost,0);
cudaMemcpyAsync(toHere_2, dev_out_2, sizeof(int), cudaMemcpyDeviceToHost,0);