CUDA: Does passing arguments to a kernel slow the kernel launch much?

Question

CUDA beginner here.

In my code i am currently launching kernels a lot of times in a loop in the host code. (Because i need synchronization between blocks). So i wondered if i might be able to optimize the kernel launch.

My kernel launches look something like this:

MyKernel<<<blocks,threadsperblock>>>(double_ptr, double_ptr, int N, double x);

So to launch a kernel some signal obviously has to go from the CPU to the GPU, but i'm wondering if the passing of arguments make this process noticeably slower.

The arguments to the kernel are the same every single time, so perhaps i could save time by copying them once, access them in the kernel by a name defined by

__device__ int N;
<and somehow (how?) copy the value to this name N on the GPU once>

and simply launch the kernel with no arguments as such

MyKernel<<<blocks,threadsperblock>>>();

Will this make my program any faster? What is the best way of doing this? AFAIK the arguments are stored in some constant global memory. How can i make sure that the manually transferred values are stored in a memory which is as fast or faster?

Thanks in advance for any help.

score 5 · Accepted Answer · answered Jun 28 '11 at 13:52

I would expect the benefits of such an optimization to be rather small. On sane platforms (ie. anything other than WDDM), kernel launch overhead is only of the order of 10-20 microseconds, so there probably isn't a lot of scope to improve.

Having said that, if you want to try, the logical way to affect this is using constant memory. Define each argument as a __constant__ symbol at translation unit scope, then use the cudaMemcpyToSymbol function to copy values from the host to device constant memory.

score 3 · Answer 2 · answered Jun 28 '11 at 13:53

3

Simple answer: no.

To be more elaborate: You need to send some signals from host to the GPU anyway, to launch the kernel itself. At this point, few more bytes of parameter data does not matter anymore.

answered Jun 28 '11 at 13:53

CygnusX1

20,968
5
65
109

In an extreme case, when using very "tight" kernels, it might actually matter, but of course this is only true if there is no other higher latency operation done in the meanwhile. – pszilard Jun 29 '11 at 01:35
1

Even with very tight kernels. Sending 1 byte or 10KB to GPU does not matter. It will take the same amount of time, because of latency and not throughput. When you start sending megabytes of data, then the throughput may come into play... – CygnusX1 Jun 29 '11 at 07:50

CUDA: Does passing arguments to a kernel slow the kernel launch much?

2 Answers2

Linked