When running concurrent copy & kernel operations:
If I have a kernel runTime that is twice as long as a dataCopy operation, will I get 2 copies per kernel run?
The stream examples I'm seeing show a 1:1 relationship. (Time of copy = time of kernel run.) I'm wondering what happens when there is something different. Is there always one copy operation (max) for every kernel launch? Or does the copy operation run independent of the kernel launch? i.e. I could possibly complete 5 copy operations for every kernel launch, if the run & copy time work out that way.
(I'm trying to figure out how many copy operations to queue up before a kernel launch.)
One to one: (time to copy = kernel run time)
<--stream1Copy--><--stream2Copy-->
..............................<-stream1Kernel->
Two to one: (time to copy = 1/2 kernel run time)
<-stream1Copy-><-stream2Copy-><-stream3Copy->
............................<----------stream1Kernel------------>