NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

Question

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).

So, I have two questions:

How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?

I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.

For CUDA: https://stackoverflow.com/questions/28987495/how-to-use-constant-memory-for-beginners-cuda-c/28993944#28993944 — Sebastian, May 06 '22 at 22:11
Cuda can handle 704 KB constant memory per kernel, with continuous arrays up to 64 KB, if I read that section correctly: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#constant-state-space — Sebastian, May 07 '22 at 05:23
@Sebastian: Your reading isn't strictly correct. There is only 64kb of programmer available constant memory per kernel. The other banks are reserved by the driver for internal static allocations like kernel arguments. — talonmies, May 07 '22 at 05:32
Additional info: https://stackoverflow.com/questions/12290708/cuda-constant-memory-banks https://stackoverflow.com/questions/10256402/why-is-the-constant-memory-size-limited-in-cuda https://stackoverflow.com/questions/66910701/is-there-a-way-to-access-value-of-constant-memory-bank-in-cuda https://stackoverflow.com/questions/45626354/what-are-the-lifetimes-for-cuda-constant-memory — Sebastian, May 07 '22 at 05:38
@talonmies On default (only) bank 2 is used for user-defined constant memory in C. Assembler (ISA) can access the full banks (never tried myself). See https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#banked-constant-state-space-deprecated how it was done up to version 2.1 and https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-function-parameter-attributes how it is done since version 2.2. You can pass several pointers to different const spaces to the kernels. — Sebastian, May 07 '22 at 05:50

talonmies · Answer 1 · 2022-05-07T03:24:55.210

How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)

In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.

How can I initialize (populate) those arrays from values that are computed at the runtime on the host?

As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.

Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.

pmdj · Answer 2 · 2022-05-08T16:55:41.667

Disclaimer: I cannot help you with CUDA.

For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.

Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:

kernel void mykernel(
    constant float* fixed_parameters,
    global const uint* dynamic_input_data,
    global uint* restrict output_data)
{

    cl_mem fixed_parameter_buffer = clCreateBuffer(
        cl_context,
        CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
        sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
        NULL);
    clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);

Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

Thanks for your answer. Does the `clCreateBuffer` API try to return a pointer to `constant float* fixed_parameters` ? If yes, how does it know which symbol we are looking for? — mgNobody, May 09 '22 at 06:30
`clCreateBuffer` returns a handle to the buffer object (`cl_mem`) which must be released when you’re finished with it. In the example, I’m copying the host array `fixed_parameter_data` into the buffer. There are many other ways of filling OpenCL buffers however - this is exactly the same as for `global` memory though. — pmdj, May 09 '22 at 06:45
There is no symbol resolution between host and kernel code in OpenCL, other than referring to kernel functions by name in the host API. The second argument to `clSetKernelArg` is the kernel argument index, here 0 because I happened to list the constant buffer as the first parameter to the kernel function. — pmdj, May 09 '22 at 06:48
I see. So, essentially, you are advocating for creating a READ_ONLY buffer on the global memory. I was looking for way to deal with `__constant` memory on the device (utilizing the texture memory). — mgNobody, May 09 '22 at 17:48
No, the buffer contents will be copied to `constant` memory (direct-addressed cache/registers/whatever the hardware uses) when you queue the kernel. What triggers it is declaring the argument `constant` in the kernel function signature instead of `global`. Host side, you manage this data identically to global memory. — pmdj, May 09 '22 at 17:53

huseyin tugrul buyukisik · Accepted Answer · 2022-05-09T17:11:39.403

1

For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:

auto kernel = R"(
..
__constant__ @@Type@@ buffer[@@SIZE@@]={
   @@elm@@
};
..
__global__ void test(int * input)
{   }

)";

then replace @@-pattern words with size and element value information in run-time and compile like this:

__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };

So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.

Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.

Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?

edited May 09 '22 at 17:11

answered May 09 '22 at 12:44

huseyin tugrul buyukisik

11,469
4
45
97

Thanks @huseyin. It makes sense. I guess I was looking for a more formal way (like enqueueMemCopy) to deal with this. – mgNobody May 09 '22 at 16:34
When kernels are loaded/switched, their environment switch too so you don't have to re-initialize __constant__ arrays again. But kernel binary increases and this increases compiling timing. At least a lot of string has to be parsed by driver. – huseyin tugrul buyukisik May 09 '22 at 16:45
Also using host class definitions requires linking through driver api. It is like writing a dll for C++ project and using the host as dll from cuda kernel's point of view. – huseyin tugrul buyukisik May 09 '22 at 16:48
I think talonmies has given best answer here. – huseyin tugrul buyukisik May 09 '22 at 16:57
Thanks for your help. Actually, your approach is a good one for OpenCL. Compiling and building a CL source file in OpenCL is a very common thing (as you know). The only downside is that we have to do some text manipulation and compile/build kernels from scratch. – mgNobody May 09 '22 at 17:45

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

3 Answers3