In CUDA, how is stream 0 related to other streams? Does stream 0 (default stream) execute concurrently with other streams in a context or not?
Considering the following example:
cudaMemcpy(Dst, Src, sizeof(float)*datasize, cudaMemcpyHostToDevice);//stream 0;
cudaStream_t stream1;
/...creating stream1.../
somekernel<<<blocks, threads, 0, stream1>>>(Dst);//stream 1;
In the above code, can the compiler ensure somekernel
always launches AFTER cudaMemcpy
finishes or will somekernel
execuate concurrently with cudaMemcpy
?