0

I have created the Scan Algorithm for CUDA from scratch and trying to use it for smaller data amounts less than 80,000 bytes.

Two separate instances were created where, one runs the kernels using streams where possible and the other runs only in the default stream.

What Ive observed is that for this range of data sizes, running with streams takes longer time to complete the task compared with the other method.

When analysed using the nvprofiler, what was observed is that for smaller amount of data sizes, running in streams will not provide parallelism for separate kernals

Without Streams Scan Without Streams

With Streams Scan With Streams

But when the data size is increased some kind of parallelism could be obtained

With Streams for 400,000bytes With Streams for 400,000

My problem is, is there some additional parameters to reduce this kernel invoking time delays or is it normal to have this kind of behavior for smaller data sizes where using streams are disadvantageous

UPDATE :

I've included the Runtime API calls timeline as well to clarify the answer

With Streams with the Runtime API

einpoklum
  • 118,144
  • 57
  • 340
  • 684
BAdhi
  • 420
  • 7
  • 19

1 Answers1

2

Generally your data is too small to fully utilize the GPU in your first case. If you check the timeline of 'Runtime API' in nvvp, which you did not show in your figures, you will find launching a kernel take a few microseconds. If your first kernel in stream 13 is too short, the second kernel in stream 14 may not be launched yet, thus there's no parallelism across streams.

Because of these overheads, you may find it even faster to run your program on CPU if the data is small.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • That's what I've suspected as well. Still from the time line I could see that the cudaLaunch API call of the second kernel ends before the execution of the first kernel ends. So technically still the second kernel can be run before the first kernel finishes. I was hoping whether there is a possibility of reducing the API cudaLaunch time to actual kernel Execution time. But I think as you mentioned this launch to execution time is considerably high with the low data sizes. And you are correct on the fact that CPU outperforming the GPU at the low data sizes. Thanks – BAdhi Jun 09 '16 at 04:38