Understanding CUDA dependency check

Question

CUDA C Programming Guide provides the following statements:

For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:

‣ Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;

‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.

I am quite lost here. What is a dependency check? Can I say a kernel execution on some device memories requires a dependency check on all the previous kernel or memory transfer involving the same device memory? If this is true (maybe not true), this dependency check blocks all later kernels from any other stream according to the above statement, and therefore no asynchronous or concurrent execution will happen afterward, which seems not true.

Any explanation or elaboration will be appreciated!

score 4 · Accepted Answer · edited Jun 18 '18 at 10:41

First of all I suggest you visit the webinar-site of nvidia and watch the webinar on Concurrency & Streams.

Furthermore consider the following points:

commands issued to the same stream are treated as dependent

e.g. you would insert a kernel into a stream after a memcopy of some data this kernel will acess. The kernel "depends" on the data being available.
commands in the same stream are therefore guaranteed to be executed sequentially (or synchronously, which is often used as synonym)
commands in different streams however are independent and can be run concurrently
so dependencies are only known to the programmer and are expressed using streams (to avoid errors)!

The following corresponds only to devices with compute capability 3.0 or lower (as stated in the quide). If you want to know more about the changes to stream scheduling behaviour with compute capability 3.5 have a look at HyperQ and the corresponding example. At this point I also want to reference this thread where I found the HyperQ examples :)

About your second question: I do not quite understand what you mean by a "kernel execution on some device memory" or a "kernel execution involving device memory" so i reduced your statement to:

A kernel execution requires a dependency check on all the previous kernels and memory transfers.

Better would be:

A CUDA operation requires a dependency check to see whether preceding CUDA operations in the same stream have completed.

I think your problem here is with the expression "started executing". That means there can still be independent (that is on different streams) kernel launches, which will be concurrent with the previous kernels, provided they have all started executing and enough device resources are available.

Understanding CUDA dependency check

1 Answers1