TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size
and %total_smem_size
.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x
, we are actually accessing the special register %tid.x
, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %
-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size
. This function makes the compiler add an explicit PTX instruction, just like asm
can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x
.