I've figured it out. I was indeed pushing onto the stack wrong. But I had a fundamental misunderstanding of how the stack worked which the Microsoft docs did a horrible job of explaining.
What I Did Wrong
Attempt #1
As @RbMm pointed out in the comments, the arguments are expected to be on RSP+20h, RSP+28h, and RSP+30h respectively. In addition, there needs to be the shadow space on the stack for the function call. I was making a series of mistakes which caused this not to work.
Let's explain the way I did the code previously:
LEA RCX, TextTestfilePath
MOV RDX, 80000000h ; GENERIC_READ
MOV R8, 00000001h ; FILE_SHARE_READ
MOV R9, 0h ; NULL
SUB RSP, 20h
PUSH 00h
PUSH 80h
PUSH 3
CALL CreateFileW
ADD RSP, 20h
I was modifying the stack pointer to push the shadow space. This is correctly, and 20h is the correct value for this because it is 32 bytes of shadow space which translates to 20h in hexadecimal. This will keep everything 16-bit aligned.
I was pushing the arguments onto the stack. The problem is, I was doing this incorrectly (or backwards). The RSP, or stack pointer, references the top of the stack. When I PUSHed the values onto the stack, it would push the values higher onto the stack. To top this off, it would modify the stack pointer so that it is no longer 16-bit aligned. The stack pointer is expected to be at 20h or 40h respectively, and not modified via a PUSH call.
After having pushed, with the values in the wrong position and the pointer in the wrong spot, the call would fail entirely.
Attempt #2
So, I attempted to correct for these mistakes by doing the following. However, I made a fatal mistake again in this process:
LEA RCX, TextTestfilePath
MOV RDX, 80000000h ; GENERIC_READ
MOV R8, 00000001h ; FILE_SHARE_READ
MOV R9, 0h ; NULL
MOV [RSP + 20h], 3
MOV [RSP + 28h], 80h
MOV [RSP + 36h], 00h
SUB RSP, 20h
CALL CreateFileW
ADD RSP, 20h
There's two major mistakes here, and this one should be more obvious.
I was pushing the values onto the top of the stack. However, by doing this, it completely overrides our shadow space with the three arguments. Then I would move the stack pointer, taking it completely away from the arguments I just pushed.
In 20h, 28h, and 36h, I was doing math wrong. I was adding 8 in decimal (20+8=28, 28+8=36), however, I should've been adding 8 in hexadecimal (20h+8h=28h, but 28h+8h != 36h, but 30h).
The assembler does not handle [RSP+28h] correctly. Instead, it was important I specified the size of value I was moving and calling the pointer. Thus, I needed to add QWORD PTR before it. (Notably, I am on x64, so I used QWORD instead of DWORD, as almost all of the MASM examples out there try and say is correct).
Attempt #3
After I resolved these problems, my code resulted in the following:
LEA RCX, TextTestfilePath
MOV RDX, 80000000h ; GENERIC_READ
MOV R8, 00000001h ; FILE_SHARE_READ
MOV R9, 0h ; NULL
SUB RSP, 20h
MOV QWORD PTR [RSP + 20h], 3
MOV QWORD PTR [RSP + 28h], 80h
MOV QWORD PTR [RSP + 30h], 00h
CALL CreateFileW
ADD RSP, 20h
This code does the following:
It moves the first four arguments into the registers, as before.
It moves the stack pointer (which, as explained before, it is top of the stack) 20h, which aligns it via 16 byte alignment for 32 bytes of shadow space. Important to note is that this, in and of itself, does not create the shadow space. While it does open 32 bytes of space, it's important we don't override the 32 bytes we just opened up. Your arguments do not go in this space.)
It puts the arguments in our newly modified stack pointer, but offsets them by 20h to avoid overriding the shadow space.
And yes, if you're seeing what I am seeing, this code is actually the same thing as doing this:
MOV QWORD PTR [RSP], 3
MOV QWORD PTR [RSP + 8h], 80h
MOV QWORD PTR [RSP + 10h], 00h
SUB RSP, 20h
This is doing the exact same thing, but it puts the arguments onto the stack before allowing the shadow space.
I prefer the syntax of +20h to account for the shadow space, as it makes it more obvious for me that we are taking it into account. But what I want you to get out of this, is that the documentation for the stack is terrible.
Attempt #4
As @RaymondChen pointed out in the comments, I was not taking into account the epilog and prolog for my function. RSP should not be modified (among a few other registers, that is, RBX, RBP, RDI, RSI, RSP, and R12 through R15) inside the body of a function. If they are modified, they must be preserved and restored prior to and following the function's call, respectively. This is the purpose of the epilog and prolog, alongside debugging when an exception occurs.
The updated function call does essentially the same thing as before, but does not modify the stack pointer:
LEA RCX, TextTestfilePath
MOV RDX, 80000000h ; GENERIC_READ
MOV R8, 00000001h ; FILE_SHARE_READ
MOV R9, 0h ; NULL
MOV QWORD PTR [RSP + 20h], 3
MOV QWORD PTR [RSP + 28h], 80h
MOV QWORD PTR [RSP + 30h], 00h
CALL CreateFileW
I've updated the "standard" below.
The x64 Stack Usage Standard (in better terms)
Here is the actual x64 stack usage standard that you need to follow when calling a Win32 function in MASM x64:
- At the beginning of your function (including main), set up a prolog.
- In this prolog is where you allocate the 20h of shadow space for function calls by subtracting
32 bytes
or (20h
) from the stack pointer, in addition to other local variables and stack arguments. An example is given below.
- Assign your first four arguments to
RCX
, RDX
, R8
, and R9
for ARG1
, ARG2
, ARG3
, and ARG4
respectively.
- Push your remaining arguments onto the stack without modifying the stack pointer, and past the 20h of reserved space (that is,
MOV QWORD PTR [RSP+20h], ARG5
, MOV QWORD PTR [RSP+28h], ARG6
, MOV QWORD PTR [RSP+30h], ARG7
and so on).
CALL
your Win32 method.
- At the end of your function and after all calls are completed (including main), set up an epilog.
- In this epilog is where you restore the stack to the original pointer prior to the function call. You'll add the same value you subtracted at the beginning of the function.
An example of a proper Win32 function call is shown below:
INCLUDELIB kernel32.lib
.CODE
main PROC
LOCAL LocalVariable: QWORD
; Prolog
PUSH RBP ; Store the RBP to restore it after
MOV RBP, RSP ; Move the RSP into RBP for debugging
SUB RSP, 40h ; 20h of shadow space for function calls
; 8h for the one local QWORD variable
; 18h for 3 stack arguments
MOV RCX, ARG1 ; Put ARG1 into RCX
MOV RDX, ARG2 ; Put ARG2 into RDX
MOV R8, ARG3 ; Put ARG3 into R8
MOV R9, ARG4 ; Put ARG4 into R9
MOV QWORD PTR [RSP + 20h], ARG5 ; Put ARG5 into RSP+20h
MOV QWORD PTR [RSP + 28h], ARG6 ; Put ARG6 into RSP+28h
MOV QWORD PTR [RSP + 30h], ARG7 ; Put ARG7 into RSP+30h
CALL MyWin32Function
; Technically, you don't need a prolog if your next
; call is going to end the process. I provide it
; for an example.
; Epilog
ADD RSP, 40h ; Same value as epilog
MOV RSP, RBP ; Restore original stack pointer
POP RBP ; Restore original RBP
RET
main ENP
END
This ensures that when you store your arguments (in RSP+20h), it is still within your epilog and prolog (which is RSP to RSP+40h of space).
You must also perform this epilog and prolog methodology for any functions you may develop or create. This avoids needing to allocate the 20h of stack space every function call, and correctly handles Win32 exception handling for the __fastcall convention so that it (and you) can 'walk the stack.'
Hopefully this helps someone understand this a little better.
I am not sure why the standards express things in terms of right to left, or front to back, or top to bottom, because this explanation is unintuitive and subjective depending on how you are viewing the stack. Using terms like ADD or SUBTRACT makes much more sense and is universal no matter the way the stack is being displayed.
I hope that this helps someone avoid the 6-7 hours of research and pain that I went through, and helps explain the stack much better! If anyone has any comments regarding my explanation as to things I may have overlooked or explained incorrectly, please let me know. However, so far this has worked for me 100% of the time.