I notice that sometimes compilers keep garbage data in the call stack. Call stack consists of function stack frames, which is the activation record of a function call. Ideally, the stack frame of a function should contain only necessary data, including spilled callee-saved registers, local variables that must be preserved across nested function calls, return address, etc.
Consider a situation where function foo()
calls into several other functions. Across these nested function calls, the activation record size of foo()
may change. Below is an example:
extern long f(long x);
extern void bar(long x);
extern void tail(void);
void foo(long x) {
long fx = f(x); // x must be preserved across the f(x) call
// because x is later used again.
bar(x + fx); // No need to preserve anything. x and fx will
// no longer be used again.
tail(); // Just to prevent tail call optimization on bar(...).
}
However, the code compiled by Clang (version 14.0.4) doesn't optimize its stack frame usage, as shown below. GCC (version 9.4.0) is similar. Optimization -O2
is enabled for both.
foo:
push %rbx // preserve %rbx
mov %rdi,%rbx // %rbx <- %rdi (%rbx preserves argument x)
call f // %rax <- f(%rdi)
add %rax,%rbx // %rbx <- %rax + %rbx
mov %rbx,%rdi // %rdi <- %rbx (from now on, %rbx is garbage)
(because x will never be used again)
call bar // bar(%rdi)
pop %rbx // restore %rbx (this should occur earlier)
jmp tail // tail()
Ideally, when the argument x
in foo()
is no longer useful, we should discard it as soon as possible so that the stack frame memory footprint is kept as small as possible.
foo:
push %rbx // preserve %rbx
mov %rdi,%rbx // %rbx <- %rdi
call f // %rax <- f(%rdi)
add %rax,%rbx // %rbx <- %rax + %rbx
mov %rbx,%rdi // %rdi <- %rbx
pop %rbx // restore %rbx (pop out 8 bytes from stack)
(before calling bar!)
call bar // bar(%rdi)
jmp tail // tail()
So here is my question: is there any compiler option that allow us to have as compact stack frame as possible?
In the case shown above, the compiler definitely misses the optimization opportunity. In general, however, keeping the stack frame as compact as possible may introduce extra instructions to manipulate the stack pointer or even data copying inside the stack frame, which poses a trade-off between the call stack memory footprint and the runtime performance.
Having a smaller call stack memory footprint is valuable on embedded systems, where the RAM is pretty limited. On PC, smaller memory footprint can lead to better cache locality and thus potentially faster execution speed.
I'm aware of the -fstack-reuse
option in GCC. The default value is all
. Changing it to other values will only make the stack memory footprint even worse.
Update 1:
Jonathan expressed the concern regarding x
being an argument, whose allocation is managed by the caller of foo()
. If x
is instead passed on stack, then things might be different.
So I update with a better example that needs to preserve an intermediate value across nested function calls.
extern long f(long x);
extern void bar(long x);
extern void tail(void);
void foo(long x) {
long fx = f(x);
bar(fx); // fx must be preserved across this call
// because it will be used again later
long ffx = f(fx); // fx used again here
// no need to preserve anything from now on
// ideally the stack frame should be
// set to 0 before calling f()
bar(ffx);
tail();
}
And the assembly code by Clang (similar to GCC)
foo:
push %rbx // preserve %rbx
call f // %rax <- f(%rdi)
mov %rax,%rbx // %rbx <- %rax (fx is preserved in %rbx)
mov %rax,%rdi // %rdi <- %rax
call bar // bar(%rdi)
mov %rbx,%rdi // %rdi <- %rbx (use fx again here)
// (ideally should pop here)
call f // %rax <- f(%rdi) ^
mov %rax,%rdi // %rdi <- %rax |
call bar // bar(%rdi) |
pop %rbx // restore %rbx ---------------------+
jmp tail
Update 2:
Unfortunately, -fconserve-stack
, -fno-defer-pop
and -foptimize-sibling-calls
don't help the examples above.