Consider the following toy example that allocates memory on the stack by means of the alloca()
function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3
results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax
in the code above results in rax
containing the (only) 16-byte-aligned address between the addresses rsp
and rsp + 15
(both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca()
align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca()
to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo
), if we pay attention to the status of the stack inside foo()
just after pushing the rbp
register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp
) and the fact that rsp
already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp
is 16-byte aligned, therefore rbp - 16
will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp
is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?