Transparently replace file mapping with anonymous

Question

I am doing a checkpoint-and restore using CRIU; in turn after restore, my application wakes with some threads that have their stack mmaped into files on disk (CRIU doesn't do it by default, this is a custom optimization). Later on, I want to transparently replace this mapping with anonymous memory - allocating new one, copying it over and finally calling mremap to the original address.

However, there's a glitch in this approach - if the threads start mutating the stack while I copy it over I could break the application. Ideally, I would trap it using userfaultfd but it's not possible to register on a file-mapped memory region. Even if I introduced some mutex to those threads there's no way to tell that the thread is really parked and won't mutate its stack until I wake it up.

I am thinking of mprotect to read-only and handling SIGSEGV. Or is there a better approach? PTrace self?

Are you saying that it is the only after the restore, as a function of custom modifications you've made to CRIU, that these threads have their stacks mmaped to a file? — John Bollinger, Apr 05 '23 at 20:00
Program behavior is undefined in the event that a signal handler is called for a `SIGSEGV` , and that call returns normally. A handler for `SIGSEGV` needs to terminate the program -- for example, by calling `_exit()` or `abort()`. — John Bollinger, Apr 05 '23 at 20:06
Yes, normally the thread stack wouldn't be allocated on file-mapped memory. Program state is really undefined by POSIX (as man sigaction says), but handling SIGSEGV is used in JVMs on all major platforms so in practice it's workable. IIUC it continues on the same instruction again when I manage to fix the memory. — Radim Vansa, Apr 06 '23 at 08:56
Not just POSIX. The C language spec says that program behavior is undefined if a signal handler is called to handle a `SIGSEGV`, and it returns normally. I can imagine that some applications might use such an approach anyway in carefully controlled, carefully tested ways, in conjunction with specific C implementations. That does not generalize. — John Bollinger, Apr 06 '23 at 13:28

score 1 · Accepted Answer · answered Apr 10 '23 at 17:13

The only alternative I have come up with that I would trust is for the main thread to use ptrace to force the others to stop, and then to resume them when that is safe. You seem to already be aware of this option, so I will not go into details. The main objective here is to preemptively suspend the activity of the affected threads while their stacks are being copied, which seems far less risky than approaches that do otherwise.

The alternative presented in the question is to use mprotect to trap the threads' attempts to modify data on their stacks while the copy is being made. I guess the idea is to have a lighter touch, allowing threads to proceed as long as they can do so without modifying their stacks, but I don't think that's plausible or viable. Among other things:

it seems unlikely in general that any thread will be able to do much meaningful work without modifying its stack, so it seems doubtful that there is much gain available in practice.
as I observed in comments, both C and POSIX specify that a program has undefined behavior if a signal handler for SIGSEGV returns normally. Usually, program termination is the only viable alternative, but a sufficiently prepared program might in some cases longjmp() or siglongjmp() out of the handler instead. That could give you a vector for recovery, but only to whatever extent you are prepared to mediate it with special tooling, and only to the extent supported by such tooling.

It is not safe to assume that the trap handler installed by the kernel will have the effect of retrying the failed instruction of your userspace program in the event that a handler for a segfault returns normally. That ranks very high among the implications of the userspace behavior being undefined. If you did observe that effect with a particular combination of hardware and software then that would be no basis for relying on the same thing for different combinations.

Thanks for the elaborate answer; right now I have a working (ofc to the tested extend) solution with SIGSEGV, I'll see the feedback in context of that project. I agree that PTrace is probably a more sound solution, but it requires more privileges, therefore probably another binary with suid. My goal is not really to let other threads continue, but rather have a simple way to block them. — Radim Vansa, Apr 11 '23 at 19:16

score 1 · Answer 2 · answered Apr 29 '23 at 17:23

That premise seems a bit weird to me, I don't really get why you'd have the stacks file-mapped after such a CRIU operation... but anyway:

First off: There is one type of file mapping that userfaultfd does work with, which is shmem/tmpfs. But I don't know whether that helps in your case. If not:

You can't register the file mapping with userfaultfd, but you can register the new anonymous mapping with userfaultfd. This means that one thing you could do would be to first replace the stack with the new mapping, then copy the data over from the file when you know the old mapping is no longer used.

You probably don't want to do exactly this, because then you'd have to block for as long as it takes to copy the entire stack. There are two optimizations you could consider:

You could try to stop the thread and figure out the thread's current stack pointer; any memory that is sufficiently far below the stack pointer based on the ABI (e.g. 128 bytes on amd64) doesn't need to be copied at all, you only have to register the currently used part of the stack with userfaultfd. (Probably a good way to do this would be to send a signal to the thread and let the signal handler take care of this.) If your threads typically have relatively little stack usage and only use lots of stack memory for short moments, this is probably all you need?
You could copy the file contents into anonymous memory area A ahead of time while letting the kernel monitor which of the file mapping pages have been written to. Then after you replace the file mapping with a new anonymous mapping B with userfaultfd, you can ask the kernel which parts of the file mapping have been written to, copy all those parts into mapping A again, and then mremap() mapping A over the file mapping. This probably only makes sense if your stacks are typically pretty big. To figure out which parts of a file mapping have changed, you can use the kernel's Soft-Dirty interface, using bit 55 in /proc/[pid]/pagemap and /proc/[pid]/clear_refs.

The mmapping is an optimalization in a forked version of CRIU - regular CRIU just loads the files, but here we make the loading lazy. Probably the best option is really to not load it in the first place, and use userfaultfd to implement the laziness. — Radim Vansa, May 03 '23 at 08:05

Transparently replace file mapping with anonymous

2 Answers2