How is stack frame managed within a thread in Cuda?

Question

Suppose we have a kernel that invokes some functions, for instance:

__device__ int fib(int n) {
    if (n == 0 || n == 1) {
        return n;
    } else {
        int x = fib(n-1);
        int y = fib(n-2);
        return x + y;
    }
    return -1;
}

__global__ void fib_kernel(int* n, int *ret) {
    *ret = fib(*n);
}

The kernel fib_kernel will invoke the function fib(), which internally will invoke two fib() functions. Suppose the GPU has 80 SMs, we launch exactly 80 threads to do the computation, and pass in n as 10. I am aware that there will be a ton of duplicated computations which violates the idea of data parallelism, but I would like to better understand the stack management of the thread.

According to the Documentation of Cuda PTX, it states the following:

the GPU maintains execution state per thread, including a program counter and call stack

The stack locates in local memory. As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?
The stack of each thread is private, which is not accessible by other threads. Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?
Is there a way that allows threads to obtain the current program counter, frame pointer values? I think they are stored in some specific registers, but PTX documentation does not provide a way to access those. May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?
If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it? The answer to question 2 might be able to address this. Any other thoughts would be appreciated.

You could just compute fib iteratively, without worrying about stack size — Abator Abetor, Nov 28 '22 at 08:41
Yeah, but local memory is not accessible by other threads. Is there a way to copy the stack frame in local memory and place it in global memory? Or we just modify the driver/compiler (not sure what has to be modified) so that the stack frame for each thread is allocated in global memory and is accessible by all other threads. This might lead to security issues but it is not that danger in terms of doing computation. — Ethan L., Nov 28 '22 at 08:45
With a debugger you can read the local memory of each thread, while single-stepping through the CUDA program. — Sebastian, Nov 28 '22 at 22:46

score 5 · Accepted Answer · answered Nov 28 '22 at 19:18

You'll get a somewhat better idea of how these things work if you study the generated SASS code from a few examples.

As the threads executing the kernel, do they behave just like the calling convention in CPU? In other words, is it true that for each thread, the corresponding stack will grow and shrink dynamically?

The CUDA compiler will aggressively inline functions when it can. When it can't, it builds a stack-like structure in local memory. However the GPU instructions I'm aware of don't include explicit stack management (e.g. push and pop, for example) so the "stack" is "built by the compiler" with the use of registers that hold a (local) address and LD/ST instructions to move data to/from the "stack" space. In that sense, the actual stack does/can dynamically change in size, however the maximum allowable stack space is limited. Each thread has its own stack, using the definition of "stack" given here.

Is there a way that I can manually instrument the compiler/driver, so that the stack is allocated in global memory, no longer in local memory?

Practically, no. The NVIDIA compiler that generates instructions has a front-end and a back-end that is closed source. If you want to modify an open-source compiler for the GPUs it might be possible, but at the moment there are no widely recognized tool chains that I am aware of that don't use the closed-source back end (ptxas or its driver equivalent). The GPU driver is also largley closed source. There aren't any exposed controls that would affect the location of the stack, either.

May I know what I have to modify (e.g. the driver or the compiler) to be able to obtain those registers?

There is no published register for the instruction pointer/program counter. Therefore its impossible to state what modifications would be needed.

If we increase the input to fib(n) to be 10000, it is likely to cause stack overflow, is there a way to deal with it?

As I mentioned, the maximum stack-space per thread is limited, so your observation is correct, eventually a stack could grow to exceed the available space (and this is a possible hazard for recursion in CUDA device code). The provided mechanism to address this is to increase the per-thread local memory size (since the stack exists in the logical local space).

Thanks for your detailed reply. Suppose we have found out a way to convert local memory instructions into global memory instructions in SASS, e.g. LDL -> LDG, can we simply replace those local memory instructions with global memory instructions? Besides, for the memory address in SASS code, are they physical memory addresses? I think it is likely to be the actual physical address since it does not require address translation, but since local memory and global memory are all in DRAM, is the memory layout like "local mem": [0x0, 0x2FFFF..), "global mem": [0x2FFFF.., 0xFFFFF)? — Ethan L., Nov 29 '22 at 02:56

How is stack frame managed within a thread in Cuda?

1 Answers1