Can GCC or Clang optimize for call stack memory footprint?

Question

I notice that sometimes compilers keep garbage data in the call stack. Call stack consists of function stack frames, which is the activation record of a function call. Ideally, the stack frame of a function should contain only necessary data, including spilled callee-saved registers, local variables that must be preserved across nested function calls, return address, etc.

Consider a situation where function foo() calls into several other functions. Across these nested function calls, the activation record size of foo() may change. Below is an example:

extern long f(long x);
extern void bar(long x);
extern void tail(void);

void foo(long x) {
    long fx = f(x); // x must be preserved across the f(x) call
                    // because x is later used again.

    bar(x + fx);    // No need to preserve anything. x and fx will
                    // no longer be used again.

    tail();         // Just to prevent tail call optimization on bar(...).
}

However, the code compiled by Clang (version 14.0.4) doesn't optimize its stack frame usage, as shown below. GCC (version 9.4.0) is similar. Optimization -O2 is enabled for both.

foo:
    push   %rbx           // preserve %rbx
    mov    %rdi,%rbx      // %rbx <- %rdi         (%rbx preserves argument x)
    call   f              // %rax <- f(%rdi)
    add    %rax,%rbx      // %rbx <- %rax + %rbx
    mov    %rbx,%rdi      // %rdi <- %rbx         (from now on, %rbx is garbage)
                                                  (because x will never be used again)
    call   bar            // bar(%rdi)
    pop    %rbx           // restore %rbx         (this should occur earlier)
    jmp    tail           // tail()

Ideally, when the argument x in foo() is no longer useful, we should discard it as soon as possible so that the stack frame memory footprint is kept as small as possible.

foo:
    push   %rbx           // preserve %rbx
    mov    %rdi,%rbx      // %rbx <- %rdi
    call   f              // %rax <- f(%rdi)
    add    %rax,%rbx      // %rbx <- %rax + %rbx
    mov    %rbx,%rdi      // %rdi <- %rbx
    pop    %rbx           // restore %rbx         (pop out 8 bytes from stack)
                                                  (before calling bar!)
    call   bar            // bar(%rdi)
    jmp    tail           // tail()

So here is my question: is there any compiler option that allow us to have as compact stack frame as possible?

In the case shown above, the compiler definitely misses the optimization opportunity. In general, however, keeping the stack frame as compact as possible may introduce extra instructions to manipulate the stack pointer or even data copying inside the stack frame, which poses a trade-off between the call stack memory footprint and the runtime performance.

Having a smaller call stack memory footprint is valuable on embedded systems, where the RAM is pretty limited. On PC, smaller memory footprint can lead to better cache locality and thus potentially faster execution speed.

I'm aware of the -fstack-reuse option in GCC. The default value is all. Changing it to other values will only make the stack memory footprint even worse.

Update 1:

Jonathan expressed the concern regarding x being an argument, whose allocation is managed by the caller of foo(). If x is instead passed on stack, then things might be different.

So I update with a better example that needs to preserve an intermediate value across nested function calls.

extern long f(long x);
extern void bar(long x);
extern void tail(void);

void foo(long x) {
    long fx = f(x);

    bar(fx);          // fx must be preserved across this call
                      // because it will be used again later

    long ffx = f(fx); // fx used again here
                      // no need to preserve anything from now on
                      // ideally the stack frame should be
                      // set to 0 before calling f()
    bar(ffx);
    tail();
}

And the assembly code by Clang (similar to GCC)

foo:
    push   %rbx      // preserve %rbx
    call   f         // %rax <- f(%rdi)
    mov    %rax,%rbx // %rbx <- %rax      (fx is preserved in %rbx)
    mov    %rax,%rdi // %rdi <- %rax
    call   bar       // bar(%rdi)
    mov    %rbx,%rdi // %rdi <- %rbx      (use fx again here)
                     //                   (ideally should pop here)
    call   f         // %rax <- f(%rdi)                        ^
    mov    %rax,%rdi // %rdi <- %rax                           |
    call   bar       // bar(%rdi)                              |
    pop    %rbx      // restore %rbx      ---------------------+
    jmp    tail

Update 2:

Unfortunately, -fconserve-stack, -fno-defer-pop and -foptimize-sibling-calls don't help the examples above.

In my under-informed opinion, it generally wouldn't be sensible to try that (which is probably why the compilers don't do it). The variable `x` is effectively allocated by the caller; the caller will do the cleanup of it. The variable `fx` usually isn't worth worrying about. If it was a multi-kilobyte array and more variables were going to be allocated locally — or in the called functions — it might be worth cleaning up, but I think you're penny-pinching without a clear benefit. But on embedded systems, you probably won't have kilobytes of local arrays. — Jonathan Leffler, Oct 31 '22 at 21:15
Although I generally agree with your sentiments, @JonathanLeffler, if you're pushing the limits of a small system then sometimes small efficiencies can be a make or break proposition. — John Bollinger, Oct 31 '22 at 21:39
Hi @JonathanLeffler, I updated with a better example to get rid of the nuance with function arguments. — Zhiyao, Oct 31 '22 at 23:17

score 1 · Answer 1 · answered Oct 31 '22 at 21:31

1

GCC has numerous optimization options that affect stack usage, among them -fno-defer-pop, -foptimize-sibling-calls, and several affecting inlining. The one most likely to perform the specific kind of optimization you ask about would be -fconserve-stack, but I cannot say whether that option actually does elicit the specific optimization you're looking for.

I do not find -fconserve-stack documented for Clang, but that doesn't necessarily mean it's not there. I generally find Clang's docs to be rather underwhelming.

answered Oct 31 '22 at 21:31

John Bollinger

160,171
8
81
157

LLVM has [support for that kind of thing](https://llvm.org/docs/LangRef.html#object-lifetime), but it's something I'd be a little surprised to see as an option in Clang. I suggest testing without any debugging symbols, so that the stack frame will have no type information to help the debugger. – arnt Nov 03 '22 at 00:45

score 0 · Answer 2 · answered Nov 05 '22 at 14:57

This proposed code isn't valid for the (x86-64 System V) ABI:

foo:
    push   %rbx           // preserve %rbx
    mov    %rdi,%rbx      // %rbx <- %rdi
    call   f              // %rax <- f(%rdi)
    add    %rax,%rbx      // %rbx <- %rax + %rbx
    mov    %rbx,%rdi      // %rdi <- %rbx
    pop    %rbx           // restore %rbx         (pop out 8 bytes from stack)
                                                  (before calling bar!)
    call   bar            // bar(%rdi)
    jmp    tail           // tail()

Stack alignment is modulo 16: %rsp+8 must be a multiple of 16 on function entry. Considering the return address is pushed by the call instruction, each frame needs to add a minimum of 8 additional bytes to the stack.

Therefore, if you pop %rbx before call bar, you would still need to additionally subtract 8 bytes from the stack before call bar and add it back before jump tail.

Can GCC or Clang optimize for call stack memory footprint?

2 Answers2