Can somebody explain the data independency requirement in concurrent Cuda streams? Assume i want to run the following kernel in 8 concurrent streams
Kernel<<<blocks, threads>>>(float *readOnlyInput, float *output);
can all streams read the same *readOnlyInput and write on different *output arrays?
Or in order to achieve concurrency they need to read data from different memory locations as well?
will the above pseudocode snippet be executed concurrently, or it needs *readOnlyInput+i*size to ensure concurrency?
cudaStream_t stream[8];
int size = 1000;//some array size
int blocks =2, threads=256;//some grid dims
for (int i = 0; i < 8; ++i){
cudaStreamCreate(&stream[i]);
}
for (int i = 0; i < 8; ++i){
Kernel<<<blocks, threads, stream[i]>>>(float *readOnlyInput, float *output + i*size);
}