Is this inline-asm approach for stack switching ok?

Question

For some functions, I need to switch the stack so that the original stack remains unmodified. For that purpose, I have written two macros as shown below.

#define SAVE_STACK()    __asm__ __volatile__ ( "mov %%rsp, %0; mov %1, %%rsp" : \
"=m" (saved_sp) : "m" (temp_sp) );
#define RESTORE_STACK() __asm__ __volatile__ ( "mov %0, %%rsp" : \
"=m" (saved_sp) );

Here temp_sp and saved_sp are thread local variables. temp_sp points to the makeshift stack that we use. For a function, whose original stack I want unmodified, I place SAVE_STACK at the beginning and RESTORE_STACK at bottom. For example, like this.

int some_func(int param1, int param2)
{
 int a, b, r;
 SAVE_STACK();
 // Function Body here
 .....................
 RESTORE_STACK();
 return r;
}

Now my question is whether this approach is fine. On x86 (64bit), the local variables and parameters are accessed through the rbp register and rsp is accordingly subtracted in function prologue and not touched until in function epilogue where it is added to bring it back to the original value. Therefore, I see no problem here.

I am not sure, if this is correct in the presence of context switches and signals though. (On Linux). Also I'm not sure if this is correct if the function is inlined or if tail call optimization (where jmp instead of call is used) is applied. Do you see any problem or side effects with this approach?

@aix: I am trying to achieve exactly what I've stated, which is to not change the original stack. Basically I have two identical processes, whose memories (including stack) I need to compare periodically. For some functions, these two processes will take different execution path and therefore we would not be able to properly compare their memories as they will get different. So for such functions, I need to switch the stack so that the original stack is same for the two processes if no other divergence occurs. — MetallicPriest, Jan 11 '12 at 12:02
Warning: GCC 4.7 will contain a new "shrink wrap" optimization that might break your assumptions; it delays part of the function prologue until they're actually needed, so the program runs more efficiently if a function has an early exit. Ubuntu GCC already has this feature, although it's disabled by default (due to stability issues). — ams, Jan 11 '12 at 12:03
@MetallicPriest: Thanks for taking the time to explain the context. — NPE, Jan 11 '12 at 12:14
Have you considered using [longjmp](http://pubs.opengroup.org/onlinepubs/7908799/xsh/longjmp.html) or [swapcontext](http://pubs.opengroup.org/onlinepubs/009695399/functions/makecontext.html)? — Piotr Praszmo, Jan 11 '12 at 12:16
@Banthar: Yes I was using swapcontext before, but it was too slow. Moreover it has restrictions, such as, you can only use int as arguments and you can't return a value. — MetallicPriest, Jan 11 '12 at 12:20
Ok. You do not modify the stack, you save it, then you restore it. Then the function returns. The stack you went to the trouble of saving is now history. And yes, an early return via compiler optimization will break your model. Are you considering what happens with inlining which compilers can do without asking? — jim mcnamara, Jan 11 '12 at 15:05
@MetallicPriest: When checking the stack don't check the last stack frame. Your approach won't work in the case where some_func calls other functions where you want the stack to be checked. You also have the issue that the temporary stack is still within "memory" and it'll get checked anyways. I have a question. Which process is doing the checking? How to do verify each process is at the same point in code? — fdk1342, Jan 11 '12 at 17:09
@MetallicPriest: Please give your question a more specific title. — Adrian McCarthy, Jan 11 '12 at 17:31

score 5 · Accepted Answer · answered Jan 11 '12 at 17:25

With the code that you've shown above, I can think of the following breakage:

On x86/x64, GCC will "deco" your function with prologues/epilogues if it sees fit, and you can't stop it from doing that (like on ARM, where __attribute__((__naked__)) forces code creation without prologues/epilogues, aka without stackframe setup).
That might end up allocating stack / creating references to stack memory locations before you switch the stack. Even worse if, again, due to the compiler's choice, such an address is put into a nonvolatile register before you switch the stack, it might alias to two locations (the stackpointer-relative one that you changed and the other-reg-relative one that is the same).
Again, on x86/x64, the ABI suggests an optimization for leaf functions (the "red zone") where no stackframe is allocated yet 128 Bytes of stack "below" the end are usable by the function. Unless your memory buffer takes this into account, overruns might occur that you're not expecting.
Signals are handled on alternate stacks (see sigaltstack()) and doing your own stack switching might make your code uncallable from within signal handlers. It'll definitely make it non-reentrant, and depending on where/how you retrieve the "stack location" will also definitely make it non-threadsafe.

In general, if you want to run a specific piece of code on a different stack, why not either:

run it in a different thread (every thread gets a different stack) ?
trigger e.g. SIGUSR1 and run your code in a signal handler (which you can configure to use a different stack) ?
run it via makecontext() / swapcontext() (see the example in the manpage) ?

Edit:

Since you say "you want to compare the memory of two processes", again, there's different methods for that, in particular external process tracing - attach a "debugger" (that can be a process you write yourself that uses ptrace() to control what you want to monitor, and have it handle e.g. breakpoints / checkpoints on behalf of those you trace, to perform the validations you need). That'd be more flexible as well because it doesn't require to change the code you inspect.

score 1 · Answer 2 · answered Jan 05 '22 at 09:30

-fomit-frame-pointer is on by default. Unless you plan to compile with optimization disabled, the assumption that functions don't touch RSP except in prologue/epilogue is super broken.

Even if you did use -O3 -fno-omit-frame-pointer, compilers will still move RSP around in some cases, although they won't use it to access args and locals. e.g. alloc / C99 VLA, or even calling a function that has more than 6 args (or more precisely, one with args that don't fit in registers), will all move RSP. (Calling a function might just use mov stores, depending on strategy chosen by the compiler.)

Also, "shrink wrap" optimization where a function delays saving call-preserved regs until after a possible early-out could mean your stack-switch happens before the compiler is ready to save/restore. And your restore might happen too late or too early. (This was mentioned in comments by ams.)

Is this inline-asm approach for stack switching ok?

2 Answers2