Can my kernel code tell how much shared memory it has available?

Question

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?

On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).

einpoklum · Accepted Answer · 2017-02-18T08:56:21.453

TL;DR: Yes. Use the function below.

It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.

Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.

But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:

__forceinline__ __device__ unsigned dynamic_smem_size()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
    return ret;
}

and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign

x = dynamic_smem_size();

you actually just assign the value of the special register to x.

Can my kernel code tell how much shared memory it has available?

1 Answers1

TL;DR: Yes. Use the function below.