The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.
In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call .param
which hold kernel and device parameter arguments. See here. Quoting from that documentation:
Each kernel function definition includes an optional list of
parameters. These parameters are addressable, read-only variables
declared in the .param state space. Values passed from the host to the
kernel are accessed through these parameter variables using ld.param
instructions. The kernel parameter variables are shared across all
CTAs within a grid.
It further notes that:
Note: The location of parameter space is implementation specific. For example, in some implementations kernel parameters reside in
global memory. No access protection is provided between parameter and
global space in this case. Similarly, function parameters are mapped
to parameter passing registers and/or stack locations based on the
function calling conventions of the Application Binary Interface
(ABI).
So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:
The constant (.const
) state space is a read-only memory initialized
by the host. Constant memory is accessed with a ld.const
instruction. Constant memory is restricted in size, currently limited
to 64 KB which can be used to hold statically-sized constant
variables. There is an additional 640 KB of constant memory,
organized as ten independent 64 KB regions. The driver may allocate
and initialize constant buffers in these regions and pass pointers to
the buffers as kernel function parameters. Since the ten regions are
not contiguous, the driver must ensure that constant buffers are
allocated so that each buffer fits entirely within a 64 KB region and
does not span a region boundary.
Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are
initialized to zero by default. Constant buffers allocated by the
driver are initialized by the host, and pointers to such buffers are
passed to the kernel as parameters.
[Emphasis mine]
So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the .const
state space accessible by defining a variable as __constant__
in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.