CUDA offers three ways of specifying kernel arguments.
- By giving an array of N pointers on each argument to cuLaunchKernel().
- By giving a buffer in which the N arguments have been packaged to cuLaunchKernel()
- By using a set of cudaSetupArgument() followed by cuLaunch() but I think this way is deprecated.
From a strict performance point of view, I'm wondering if one approach is better than the other. Does anyone know if:
- Option 1. will result in N GPU accesses whereas option 2. will only result in one ?
- If true for option 1., will CUDA re-access the GPU for setting a parameter even if its value has not changed, across several kernel calls ?
My real issue underneath those questions is that I have a kernel rather "simple" with a huge number of arguments which is called multiple times with (almost) the same argument values and I was wondering if just passing arguments could have a real impact on performance.
Answers here do not fully answer my questions.
EDIT: Also, does anyone know if nvprof measures just kernel time, or argument passing + kernel time ?