When passing parameter by value to kernel function, where are parameters copied?

Question

I'm beginner at CUDA programming and have a question.

When I pass parameters by value, like this:

__global__ void add(int a, int b, int *c) {
    // some operations
}

Since variable a and b are passed to kernel function add as copied value in function call stack, I guessed some memory space would be needed to copy in.

If I'm right, is that additional memory space where those parameters are copied in GPU or in Host's main memory?

The reason why I wonder this problem is that I should pass a big struct to kernel function.

I also thought pass a pointer of the struct, but these way seems to be required to call cudamalloc for the struct and each member variables.

pass-by-value arguments to a kernel call are placed in `__constant__` memory, which is a special kind of device memory. At the beginning of kernel code (usually), needed arguments will be copied from `__constant__` memory into registers. This is covered in the programming guide [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-parameters). — Robert Crovella, Nov 25 '17 at 14:11

talonmies · Answer 1 · 2017-11-28T08:54:37.197

The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.

In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call .param which hold kernel and device parameter arguments. See here. Quoting from that documentation:

Each kernel function definition includes an optional list of parameters. These parameters are addressable, read-only variables declared in the .param state space. Values passed from the host to the kernel are accessed through these parameter variables using ld.param instructions. The kernel parameter variables are shared across all CTAs within a grid.

It further notes that:

Note: The location of parameter space is implementation specific. For example, in some implementations kernel parameters reside in global memory. No access protection is provided between parameter and global space in this case. Similarly, function parameters are mapped to parameter passing registers and/or stack locations based on the function calling conventions of the Application Binary Interface (ABI).

So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:

The constant (.const) state space is a read-only memory initialized by the host. Constant memory is accessed with a ld.const instruction. Constant memory is restricted in size, currently limited to 64 KB which can be used to hold statically-sized constant variables. There is an additional 640 KB of constant memory, organized as ten independent 64 KB regions. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters. Since the ten regions are not contiguous, the driver must ensure that constant buffers are allocated so that each buffer fits entirely within a 64 KB region and does not span a region boundary.

Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are initialized to zero by default. Constant buffers allocated by the driver are initialized by the host, and pointers to such buffers are passed to the kernel as parameters.

[Emphasis mine]

So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the .const state space accessible by defining a variable as __constant__ in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.

When passing parameter by value to kernel function, where are parameters copied?

1 Answers1