2

Can somebody explain the data independency requirement in concurrent Cuda streams? Assume i want to run the following kernel in 8 concurrent streams

Kernel<<<blocks, threads>>>(float *readOnlyInput, float *output);

can all streams read the same *readOnlyInput and write on different *output arrays?

Or in order to achieve concurrency they need to read data from different memory locations as well?

will the above pseudocode snippet be executed concurrently, or it needs *readOnlyInput+i*size to ensure concurrency?

cudaStream_t stream[8];

int size = 1000;//some array size

int blocks =2, threads=256;//some grid dims

for (int i = 0; i < 8; ++i){

    cudaStreamCreate(&stream[i]);

}
for (int i = 0; i < 8; ++i){

    Kernel<<<blocks, threads, stream[i]>>>(float *readOnlyInput, float *output + i*size);

}
einpoklum
  • 118,144
  • 57
  • 340
  • 684

1 Answers1

1

You can safely read the same data from multiple independent kernels in different streams, as long as there is sufficient synchronization in place to make sure the data is written entirely before any of the kernels starts, and the data is not being overwritten again before all of the kernels have finished.

tera
  • 7,080
  • 1
  • 21
  • 32
  • what happens if i want to run them concurrently, without stream synchronization? – thanasisanthopoulos Oct 04 '12 at 19:54
  • Then the data will get changed while the kernels read it, which is not a good idea. You can run the kernels in parallel, but the data must be in place and remain unchanged for the whole duration of all kernels. – tera Oct 05 '12 at 10:20
  • yes but they read the same read-only array and manipulate different portions of an output array. There is no write conflict. They just read some read-only array from the same memory location..how can the read-only data get changed while kernels read it? – thanasisanthopoulos Oct 05 '12 at 11:36
  • Yes, I understand there is no change to the input array within the code you showed. But I'm trying to point out that the input array has to be created at some time and will not live on eternally. So a more complete (but not syntactically correct) code snippet may be structured like this: `cudaMalloc(readOnlyInput)`, `cudaMalloc(allOutputs)`, `cudaMemcpy(readOnlyInput)`, `cudaDeviceSynchronize()`, run kernels, `cudaDeviceSynchronize()`, `cudaFree(readOnlyInput)`, `cudaMemcpy(allOutputs)`, `cudaFree(allOutputs)`. – tera Oct 05 '12 at 15:06