Efficient way to pass arguments to cuLaunchKernel?

Question

CUDA offers three ways of specifying kernel arguments.

By giving an array of N pointers on each argument to cuLaunchKernel().
By giving a buffer in which the N arguments have been packaged to cuLaunchKernel()
By using a set of cudaSetupArgument() followed by cuLaunch() but I think this way is deprecated.

From a strict performance point of view, I'm wondering if one approach is better than the other. Does anyone know if:

Option 1. will result in N GPU accesses whereas option 2. will only result in one ?
If true for option 1., will CUDA re-access the GPU for setting a parameter even if its value has not changed, across several kernel calls ?

My real issue underneath those questions is that I have a kernel rather "simple" with a huge number of arguments which is called multiple times with (almost) the same argument values and I was wondering if just passing arguments could have a real impact on performance.

Answers here do not fully answer my questions.

EDIT: Also, does anyone know if nvprof measures just kernel time, or argument passing + kernel time ?

As the accepted answer says `kernel launch overhead is only of the order of 10-20 microseconds, so there probably isn't a lot of scope to improve` I don't see a point on this question. If you application is slow the reason would probably be hidden somewhere else. — KiaMorot, Sep 12 '14 at 08:43
Yes, but it is too vague. I was wondering if this time (as small as it may be) was proportional to the number of arguments given to the kernel. I've never complained about my application being slow. — GaTTaCa, Sep 12 '14 at 14:40

score 1 · Accepted Answer · answered Sep 12 '14 at 13:23

1

The driver overhead for a kernel launch is high enough that the method of parameter passing has negligible impact.

If your kernel's parameter set is large enough, and changes little enough, you may get some mileage from manually copying the parameters into device memory (or constant memory if you're feeling fancy).

answered Sep 12 '14 at 13:23

ArchaeaSoftware

4,332
16
21

Won't reading arguments from memory be far more expensive than from registers ? – GaTTaCa Sep 12 '14 at 14:45
Actually nVidia best practives guide states that "For all threads of a half warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address" so it should be ok. – GaTTaCa Sep 12 '14 at 14:50
1

Don't believe everything NVIDIA says about memory performance :-), but for parameter passing it would work fine unless you are trying to run multiple instances of the kernel concurrently or something. In any case, the parameters you are specifying to cuLaunchKernel() get read and copied by the CPU (running the driver) into the command buffer that the GPU reads to launch the kernel. So the operation is far removed from GPU registers, until it's time for the kernel to read them. – ArchaeaSoftware Sep 13 '14 at 20:32

Efficient way to pass arguments to cuLaunchKernel?

1 Answers1