CUDA shared memory address space vs. global memory

Question

To avoid really long and incohesive functions I am calling a number of __device__ functions from a kernel. I allocate a shared buffer at the beginning of the kernel call (which is per-thread-block) and pass pointers to it to all the __device__ functions that are performing some processing steps in the kernel.

I was wondering about the following:

If I allocate a shared memory buffer in a __global__ function, how can other __device__ functions that I pass a pointer distinguish between the possible address types (global device or shared memory) that the pointer could refer to?

Note that it is invalid to decorate the formal parameters with a __shared__ modifier according to the CUDA programming guide. The only way it could be implemented IMHO is by

a) putting markers on the allocated memory.
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for global and shared memory and a threshold check on the pointer can be used?

So my question is: Do I need to worry about it or how should one proceed alternatively without inlining all functions into the main kernel?

On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all ___device___ functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?

For that compute capability are you compile? – geek Jun 07 '12 at 22:36 — geek, Jun 07 '12 at 22:36

score 1 · Answer 1 · answered Jun 07 '12 at 16:11

1

If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.

Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).

On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?

CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.

answered Jun 07 '12 at 16:11

Roger Dahl

15,132
8
62
82

Yes, all threads in the block see the same memory but if I do not pass a pointer to the __device__ functions (each of those is also executed by all threads in the exec. configuration) I have no way of referring to the shared memory [the array declared in the __global__ function is not in scope]. Since the scope of the __global__ kernel should still be open while executing the device functions, the pointer should remain valid.. The reference guide states that allocated shared memory remains valid for the duration of kernel execution in which it was declared. – Matthias Hueser Jun 07 '12 at 17:26
I am now having doubts whether NVIDIA ever intended users to pass pointers to shared memory around. So to clarify I just pass the base pointer to the shared memory variable around, never any data.. The threads which are executing on different blocks will pass the base pointer to their blocks shared memory buffer to the functions and then continue execution them inline... – Matthias Hueser Jun 07 '12 at 17:27
@MatthiasHueser: I pass pointers to shared/global data around all the time, and the compiler rarely complains (for compute capability < 2.0), and most importantly, everything works fine. – Pedro Jun 07 '12 at 17:41
@MatthiasHueser: What you're doing sounds completely fine. I thought that maybe you were using the shared memory to pass what would be similar to function arguments (different for each thread). – Roger Dahl Jun 07 '12 at 17:55
"The reference guide states that allocated shared memory remains valid for the duration of kernel execution in which it was declared." This doesn't sound right but it depends on the definition of "valid". The shared memory is there for all threads but only threads in the same block can use it for sharing data. – Roger Dahl Jun 07 '12 at 17:59
"The shared memory is there for all threads but only threads in the same block can use it for sharing data. " Now I have to ask what 'being there' exactly means, from the programmers point of view we can imagine it is like List(size=GridDim.x) is declared behind the scenes and for each Block only its 'own' SHARED_MEMORY.get(BlockIdx.x) is in scope and aliased to SHARED_MEMORY. On the hardware of course each element of the 'container' I just outlined lives on different Streaming Multiprocessors.(so i didnt use array-analogy which assumes sequential spacing in one memory space) – Matthias Hueser Jun 07 '12 at 21:09
That is correct, except for one thing and that is that some of those shared memory blocks may exist on the same SM. That is, multiple thread blocks can be "in flight" at the same time on a single SM given that there is enough shared memory (and other resources) to accommodate them. So, if each thread block declares a 20KiB block of shared memory, two (but not three) thread blocks may run on one SM, on a device which has 48KiB of shared memory. – Roger Dahl Jun 08 '12 at 01:51

score 0 · Answer 2 · answered Jun 07 '12 at 15:11

0

This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.

By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.

answered Jun 07 '12 at 15:11

Pedro

1,344
9
17

The compiler ever issuing :: 'warning : Cannot tell what pointer points to, assuming global memory space' gives a strong indication that it cannot figure out with flow analysis whether the pointer is to global or shared memory. The question then is why does it not allow me to annotate the formal parameters. The programmer should always know at compile-time to which region a pointer should point [Using shared or global device memory is among the most important 'design' decisions you can make in CUDA].. The problem is that this knowledge cannot be preserved along function barriers. – Matthias Hueser Jun 07 '12 at 15:36
Note also casting to (__shared__) as one might expect from analogy with casting to (const) -- the analogy is not perfect i know -- does not work.. The compiler complains that "__shared__" is not allowed in this context. – Matthias Hueser Jun 07 '12 at 15:42
I am compiling with flags '--generate-code -arch=compute_20,code=sm_20' so it should be safe then. Nonetheless I am surprised why the easy way is not chosen and cross-function flow-analysis done instead. I think this creates the requirement for inlining in the first place... – Matthias Hueser Jun 07 '12 at 15:57
@MatthiasHueser: Try compiling with `-arch=sm_20`. – Pedro Jun 07 '12 at 16:33
@MatthiasHueser: I don't think compiling to >= 2.0 should give the "Cannot tell what pointer points to" warning. Could it be that you're also inheriting the project defaults which compile to 1.0? If so, your code gets compiled once for each architecture and the warning comes from 1.0 compile. – Roger Dahl Jun 07 '12 at 17:51

CUDA shared memory address space vs. global memory

2 Answers2