3

I'm having trouble wrapping my head around the restrictions on CUDA constant memory.

  1. Why can't we allocate __constant__ memory at runtime? Why do I need to compile in a fixed size variable with near-global scope?

  2. When is constant memory actually loaded, or unloaded? I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory? Related, is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?

  3. Where does constant memory reside on the chip?

CUDA Architecture

I'm primarily interested in answers as they relate to Pascal and Volta.

Community
  • 1
  • 1
Mikhail
  • 7,749
  • 11
  • 62
  • 136
  • no, each kernel does not get or use its own allocation of `__constant__` memory. There are no binding/unbinding costs. `__constant__` memory is a **logical** space. The physical backing consists of a particular region of DRAM plus the constant caches. The constant caches are a per-SM resource. The DRAM is an off-chip resource. The maximum size of the logical space is 64KB. That does not mean that each per-SM cache is 64KB. The actual size of the constant caches is not well specified and may change from one GPU arch to another, but is on the order of 8KB per SM. – Robert Crovella Aug 11 '17 at 03:24
  • @RobertCrovella So, when are the constant caches unloaded? Is it loaded/unloaded for each invocation? Among other things, I'm worried about a few huge `__constant__` objects in unrelated parts of my code, potentially exceeding the const cache size. – Mikhail Aug 11 '17 at 03:29
  • @RobertCrovella Also, I understand its a logic space but I'm not sure what else lives in the physical mapping to that space (trying to wrap my mind around some of the limits of the constant cache) – Mikhail Aug 11 '17 at 03:31
  • Constant caches, like most other caches that I am familiar with, are loaded as you make accesses to (`__constant__`) memory. Isn't that how most caches work? There is no explicit "loading" or "unloading" of a cache that I am familiar with. At a basic level, a cache is a transparent entity from the programmer's perspective. If you are asking are there any global invalidations of the constant caches, I'm pretty sure that information is not specified anywhere. One of the ramifications of "not specified" is that it could change at any time, therefore an "answer" given today may be wrong tomorrow. – Robert Crovella Aug 11 '17 at 03:36
  • @RobertCrovella Well its explicitly loaded through `cudaMemcpytoSymbol`, so it doesn't feel like a traditional cache. Certainly not like the x86 cache were stuff is magically "cached". Feels more like shared memory but with more restrictions. – Mikhail Aug 11 '17 at 03:37
  • No, `cudaMemcpyToSymbol` loads the underlying physical backing, i.e. DRAM. And it will flow through the L2, (didn't mention that above), but **by definition** it does not flow through the constant caches (they are read-only in character, and do not permit or track writes). `__constant__` **memory** is initialized that way. The constant caches behave the way most other caches behave (as you make accesses to the underlying data). An x86 cache only caches the things that are in DRAM, and those things are in DRAM because you, as the programmer, explicitly put them there, through some mechanism. – Robert Crovella Aug 11 '17 at 03:39

1 Answers1

5

It is probably easiest to answer these six questions in reverse order:

Where does constant memory reside on the chip?

It doesn't. Constant memory is stored in statically reserved physical memory off-chip and accessed via a per-SM cache. When the compiler can identify that a variable is stored in the logical constant memory space, it will emit specific PTX instructions which allow access to that static memory via the constant cache. Note also that there are specific reserved constant memory banks for storing kernel arguments on all currently supported architectures.

Is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?

No. But there also isn't "binding" or "unbinding" because reservations are performed statically. The only runtime costs are host to device memory transfers and the cost of loading the symbols into the context as part of context establishment.

I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory?

No. There is only one "allocation" for the entire GPU (although as noted above, there is specific constant memory banks for kernel arguments, so in some sense you could say that there is a per-kernel component of constant memory).

When is constant memory actually loaded, or unloaded?

It depends what you mean by "loaded" and "unloaded". Loading is really a two phase process -- firstly retrieve the symbol and load it into the context (if you use the runtime API this is done automagically) and secondly any user runtime operations to alter the contents of the constant memory via cudaMemcpytoSymbol.

Why do I need to compile in a fixed size variable with near-global scope?

As already noted, constant memory is basically a logical address space in the PTX memory hierarchy which is reflected by a finite size reserved area of the GPU DRAM map and which requires the compiler to emit specific instructions to access uniformly via a dedicated on chip cache or caches. Given its static, compiler analysis driven nature, it makes sense that its implementation in the language would also be primarily static.

Why can't we allocate __constant__ memory at runtime?

Primarily because NVIDIA have chosen not to expose it. But given all the constraints outlined above, I don't think it is an outrageously poor choice. Some of this might well be historic, as constant memory has been part of the CUDA design since the beginning. Almost all of the original features and functionality in the CUDA design map to hardware features which existed for the hardware's first purpose, which was the graphics APIs the GPUs were designed to support. So some of what you are asking about might well be tied to historical features or limitations of either OpenGL or Direct 3D, but I am not familiar enough with either to say for sure.

talonmies
  • 70,661
  • 34
  • 192
  • 269