2

I've written a short PE32+ program in FASM that writes "Hello World!" to stdout and quits.

format PE64 console
include 'win64wx.inc'
.code
  start:
    invoke WriteFile,<invoke GetStdHandle,STD_OUTPUT_HANDLE>,hello,hello.length,dummy,0
    invoke ExitProcess,0
  .end start
.data
  dummy rd 1  
  hello db "Hello world!",13,10,0
  hello.length = $ - hello

I've looked at the generated machine code, but I cannot understand why RSP is manipulated the way it is. This is the disassembly:

sub rsp,byte +0x08         ;Allocate 8 bytes on the stack. 
sub rsp,byte +0x30         ;Allocate shadow space for WriteFile (48 bytes)
sub rsp,byte +0x20         ;Allocate shadow space for GetStdHandle
mov rcx,0xfffffffffffffff5 ;Set the constant for stdout
call [rel 0x1060]          ;Call GetStdHandle. The handle for stdout is now in RAX
add rsp,byte +0x20         ;Deallocate shadow space for GetStdHandle
mov rcx,rax                ;Set stdout handle: hFile
mov rdx,0x403004           ;Set the pointer to string "Hello World!\r\n": lpBuffer
mov r8,0xf                 ;Set the length of the string: nNumberOfBytesToWrite
mov r9,0x403000            ;Set the pointer for lpNumberOfBytesWritten
mov qword [rsp+0x20],0x0   ;Push a 64 bit NULL pointer onto the stack: lpOverlapped
call [rel 0x1068]          ;Call WriteFile
add rsp,byte +0x30         ;Deallocate shadow space for WriteFile
sub rsp,byte +0x20         ;Allocate shadow space for ExitProcess
mov rcx,0x0                ;Set the return value
call [rel 0x1058]          ;Call ExitProcess
add rsp,byte +0x20         ;Deallocate shadow space for ExitProcess

I understand it doesn't really matter that the space for WriteFile is allocated well in advance, but why is it sub rsp,byte +0x30 and not sub rsp,byte +0x28? And why is the first sub rsp,byte +0x08 there? Is it FASM's idiosyncrasy or am I fundamentally misunderstanding Microsoft x64 stack management rules?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alexey
  • 1,354
  • 13
  • 30
  • Yeah this code is terrible. Can you add the source code that generated it? It doesn’t need separate shadow space for each function called. And it’s crazy to adjust the stack point multiple times. It should just do `sub rsp, 0x28` once at the beginning of the function. – prl Dec 17 '20 at 13:07
  • The way this code is formatted makes it very difficult to read. Can you please gel rid of all the extra lines. – prl Dec 17 '20 at 13:08
  • 1
    There is one rule that you seem to be unaware of. The stack has to be 16-byte aligned. That’s probably what the first sub 8 is for. Then all the subsequent adjustments are multiples of 16. – prl Dec 17 '20 at 13:11
  • @prl I've added the FASM source and cleaned up the disassembly. Why would the stack not be aligned when the program starts? – Alexey Dec 17 '20 at 20:45
  • 3
    The stack is aligned when the program starts. Each function call pushes an 8-byte return address on the stack. So each called function must realign the stack to a multiple of 16, by subtracting an odd multiple of 8. – prl Dec 17 '20 at 20:55
  • 2
    Thanks—it’s clear now that each invoke call has its own stack adjustment, which is why there are so many of them. – prl Dec 17 '20 at 20:57
  • 3
    Moral of this story: don't use `invoke` in 64-bit code: the only way for code to be efficient is to reserve shadow space once that multiple calls use, but an `invoke` macro or built-in statement can't assume anything about surrounding code. If you want to write high-level code, use a C compiler. MASM chose differently from FASM: they [dropped support for `invoke` from 64-bit](https://stackoverflow.com/questions/65279900/how-does-32-bit-masm-mode-differ-from-64-bit#comment115431977_65280053), along with their `.if` crap (again, use a C compiler if you want "high level" code.) – Peter Cordes Dec 18 '20 at 03:18

1 Answers1

3

The comments already addressed the 16-byte stack alignment, which causes the stack to become misaligned after pushing the return address in a call instruction, meaning you have to re-align it with sub rsp, 8 in the function prolog, see https://learn.microsoft.com/en-us/cpp/build/stack-usage. You also mentioned the shadow store of 32 (0x20) bytes, which can be used to spill the first 4 parameters, which are passed in registers, see https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention.
While the caller is responsible for allocating the shadow store, this can of course be combined for multiple function calls. FASM can actually do this though, using the frame/endf macro, see https://flatassembler.net/docs.php?article=win32#1.4. Now if you explicitly call GetStdHandle, removing the nesting (not sure why this is required), you can combine the allocations:

format PE64 console
include 'win64wx.inc'
.code
  start:
   frame
    invoke GetStdHandle,STD_OUTPUT_HANDLE
    invoke WriteFile,rax,hello,hello.length,dummy,0
    invoke ExitProcess,0
   endf
  .end start
.data
  dummy rd 1  
  hello db "Hello world!",13,10,0
  hello.length = $ - hello

Which assembles to:

sub rsp,0x8
sub rsp,0x30
mov rcx,0xFFFFFFFFFFFFFFF5
call qword ptr ds:[<&GetStdHandle>]
mov rcx,rax
mov rdx,test.403004
mov r8,0xF
mov r9,test.403000
mov qword ptr ss:[rsp+0x20],0x0
call qword ptr ds:[<&WriteFile>]
mov rcx,0x0
call qword ptr ds:[<&FatalExit>]
add rsp,0x30

The sub rsp, 0x20 is gone now. Unfortunately, the sub rsp, 8 is still here (it's inserted by the .code macro), but it's much cleaner. If you use the proc macro then the push rbp will already align the stack, so you don't need an extra sub rsp, 8.

Now of course this is FASM, so you can add/redefine some macro's to combine all allocations if you don't use proc, and maybe even use the shadow store for spills of nonvolatile registers you use, and keep the stack unaligned when you don't call functions. This is what I did in https://github.com/stevenwdv/asmsequent/blob/main/proc_mod.inc (with frame/save/rest(ore)/... macros), it may be a bit ugly and I don't know (anymore) how most of it exactly works, but it got the job done. E.g. function solve assembles to:

mov qword ptr ss:[rsp+0x8],r12
mov qword ptr ss:[rsp+0x10],r13
mov qword ptr ss:[rsp+0x18],r14
mov qword ptr ss:[rsp+0x20],r15
push rsi
push rbx
sub rsp,0x28
...
call ...
...
call ...
...
add rsp,0x28
pop rbx
pop rsi
mov r15,qword ptr ss:[rsp+0x20]
mov r14,qword ptr ss:[rsp+0x18]
mov r13,qword ptr ss:[rsp+0x10]
mov r12,qword ptr ss:[rsp+0x8]
ret

Note the combined allocations (and the usage of the shadow store where possible).

SWdV
  • 1,715
  • 1
  • 15
  • 36