Understanding GCC's alloca() alignment and seemingly missed optimization

Question

Consider the following toy example that allocates memory on the stack by means of the alloca() function:

#include <alloca.h>

void foo() {
    volatile int *p = alloca(4);
    *p = 7;
}

Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:

foo:
   pushq   %rbp
   movq    %rsp, %rbp
   subq    $16, %rsp
   leaq    15(%rsp), %rax
   andq    $-16, %rax
   movl    $7, (%rax)
   leave
   ret

Honestly, I would have expected a more compact assembly code.

16-byte alignment for allocated memory

The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).

This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?

Possible missed optimization?

Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:

Size          Stack          RSP mod 16      Description
-----------------------------------------------------------------------------------
        ------------------
        |       .        |
        |       .        | 
        |       .        |            
        ------------------........0          at "call foo" (stack 16-byte aligned)
8 bytes | return address |
        ------------------........8          at foo entry
8 bytes |   saved RBP    |
        ------------------........0  <-----  RSP is 16-byte aligned!!!

I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:

foo:
   pushq   %rbp
   movq    %rsp, %rbp
   movl    $7, -16(%rbp)
   leave
   ret

The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.

Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:

foo:
   movl    $7, -8(%rsp)
   ret

Is this just a missed optimization or I am missing something else here?

Running on macOS? The macOS ABI requires 16 bytes stack alignment... — Macmade, Sep 26 '18 at 20:59
@Macmade: That requirement applies before a `call`. There's no requirement that functions keep RSP 16-byte aligned *at all times*. If gcc has to adjust RSP for anything, it will make it 16-byte aligned, but if it can just use the red-zone for locals it will leave RSP untouched (other than possible push/pop). — Peter Cordes, Sep 26 '18 at 21:03

score 6 · Answer 1 · edited Sep 27 '18 at 18:14

This is (partially) missed optimization in gcc. Clang does it as expected.

I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).

__builtin_alloca_with_align is your friend ;)

Here is an example (changed so the compiler will not reduce function call to single ret):

#include <alloca.h>

volatile int* p;

void foo() 
{
    p = alloca(4) ;
    *p = 7;
}

void zoo() 
{
    // aligment is 16 bits, not bytes
    p = __builtin_alloca_with_align(4,16) ;
    *p = 7;
}

int main()
{
  foo();
  zoo();
}

Disassembled code (with objdump -d -w --insn-width=12 -M intel)

Clang will produce the following code (clang -O3 test.c) - both functions look alike

0000000000400480 <foo>:
  400480:       48 8d 44 24 f8                          lea    rax,[rsp-0x8]
  400485:       48 89 05 a4 0b 20 00                    mov    QWORD PTR [rip+0x200ba4],rax        # 601030 <p>
  40048c:       c7 44 24 f8 07 00 00 00                 mov    DWORD PTR [rsp-0x8],0x7
  400494:       c3                                      ret    

00000000004004a0 <zoo>:
  4004a0:       48 8d 44 24 fc                          lea    rax,[rsp-0x4]
  4004a5:       48 89 05 84 0b 20 00                    mov    QWORD PTR [rip+0x200b84],rax        # 601030 <p>
  4004ac:       c7 44 24 fc 07 00 00 00                 mov    DWORD PTR [rsp-0x4],0x7
  4004b4:       c3                                      ret

GCC this one (gcc -g -O3 -fno-stack-protector)

0000000000000620 <foo>:
 620:   55                                      push   rbp
 621:   48 89 e5                                mov    rbp,rsp
 624:   48 83 ec 20                             sub    rsp,0x20
 628:   48 8d 44 24 0f                          lea    rax,[rsp+0xf]
 62d:   48 83 e0 f0                             and    rax,0xfffffffffffffff0
 631:   48 89 05 e0 09 20 00                    mov    QWORD PTR [rip+0x2009e0],rax        # 201018 <p>
 638:   c7 00 07 00 00 00                       mov    DWORD PTR [rax],0x7
 63e:   c9                                      leave  
 63f:   c3                                      ret    

0000000000000640 <zoo>:
 640:   48 8d 44 24 fc                          lea    rax,[rsp-0x4]
 645:   c7 44 24 fc 07 00 00 00                 mov    DWORD PTR [rsp-0x4],0x7
 64d:   48 89 05 c4 09 20 00                    mov    QWORD PTR [rip+0x2009c4],rax        # 201018 <p>
 654:   c3                                      ret

As you can see zoo now looks like expected and similar to clang code.

Peter Cordes · Accepted Answer · 2018-09-28T02:20:47.933

The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.

It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)

A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.

So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.

Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.

You're right, it should be able to optimize it to this:

void foo() {
    alignas(16) int dummy[1];
    volatile int *p = dummy;   // alloca(4)
    *p = 7;
}

and compile it to the movl $7, -8(%rsp) ; ret you suggested.

The alignas(16) might be optional here for alloca.

If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.

Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

Understanding GCC's alloca() alignment and seemingly missed optimization

16-byte alignment for allocated memory

Possible missed optimization?

2 Answers2

Linked