0

I am trying to understand CUDA streams and I have made my first program with streams, but It is slower than usual kernel function...

why is this code slower

cudaMemcpyAsync(pole_dev, pole, size, cudaMemcpyHostToDevice, stream_1);    
addKernel<<<count/100, 100, 0, stream_1>>>(pole_dev);
cudaMemcpyAsync(pole, pole_dev, size, cudaMemcpyDeviceToHost, stream_1);
cudaThreadSynchronize();  // I don't know difference between cudaThreadSync and cudaDeviceSync
cudaDeviceSynchronize();  // it acts relatively same...

than:

cudaMemcpy(pole_dev, pole, size, cudaMemcpyHostToDevice);
addKernel<<<count/100, 100>>>(pole_dev);
cudaMemcpy(pole, pole_dev, size, cudaMemcpyDeviceToHost);

I thounght that it should run faster ... value of variable count is 6 500 000 (maximum) ... first source code takes 14 millisecconds and second source code takes 11 milliseconds.

Can anybody explain it to me, please?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Snurka Bill
  • 973
  • 4
  • 12
  • 29

1 Answers1

2

In this snippet you like dealing with only a single stream (stream_1), but that's actually what CUDA automatically does for you when you don't explicitely manipulate streams.

To take advantage of streams and asynchronous memory transfers, you need to use several streams, and split your data and computations through each of them.

jopasserat
  • 5,721
  • 4
  • 31
  • 50
  • so you mean, that in first memcpy function I should use stream_1, in kernel use stream_2 and in second memcpy function use stream_3 ? – Snurka Bill Dec 09 '12 at 10:11
  • 1
    @snukrabill: streams can't speed up this operation. The is no scope for overlap between copy and kernel execution, you first copy must complete before the kernel can run, and your second copy can't start until the kernel has finished. – talonmies Dec 09 '12 at 10:18
  • I thought that streams could divide big array into small pieces and kernel could be started on this pieces ... as you said, now I need to wait, until memcpy is finished, but I want memcpy and kernel runing at the same time ... so this isn't right way? – Snurka Bill Dec 09 '12 at 10:22
  • 1
    You need a first synchronous mempcy of the first part of your data, then launch a first kernel corresponding to this data. Kernel calls are asynchronous, that allows you to start a second memcpy which will be asynchronous on a second chunk of data, whilst the first kernel is still running. Then the corresponding kernel can be started, and so on and so forth... – jopasserat Dec 09 '12 at 10:34
  • 2
    @snukrabill: No streams are simply a series of execution queues for the GPU and driver. If you want overlapping transfers and kernel execution, you need to design and implement the design pattern which will make it happen. Streams don't magically do that for you. – talonmies Dec 09 '12 at 10:37