CUDA C Programming Guide provides the following statements:
For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.
I am quite lost here. What is a dependency check? Can I say a kernel execution on some device memories requires a dependency check on all the previous kernel or memory transfer involving the same device memory? If this is true (maybe not true), this dependency check blocks all later kernels from any other stream according to the above statement, and therefore no asynchronous or concurrent execution will happen afterward, which seems not true.
Any explanation or elaboration will be appreciated!