3

How to correctly fork a child process in twisted that does not use anything from twisted (but uses data from the parent process) (e.g. to process a “snapshot” of some data from the parent process and write it to file, without blocking)?

It seems if I do anything like clean shutdown in the child process after os.fork(), it closes some of the sockets / descriptors in the parent process; the only way to avoid that that I see is to do os.kill(os.getpid(), signal.SIGKILL), which does seem like a bad idea (though not directly problematic).

(additionally, if a dict is changed in the parent process, can it be that it will change in the child process too? Quick test shows that it doesn't change, though. OS/kernels are debian stable / sid)

HoverHell
  • 4,739
  • 3
  • 21
  • 23
  • (I somewhat suspect one of the `os.exec*` with something like a do-nothing shell would solve that) – HoverHell Nov 01 '12 at 16:45
  • Can you elaborate on what you mean by "uses data from the parent process"? The specifics of the data you want to work with will probably make a big difference to the best answer to this question. – Jean-Paul Calderone Nov 01 '12 at 22:59

1 Answers1

0

IReactorProcess.spawnProcess (usually available as from twisted.internet import reactor; reactor.spawnProcess) can spawn a process running any available executable on your system. The subprocess does not need to use Twisted, or, indeed, even be in Python.

Do not call os.fork yourself. As you've discovered, it has lots of very peculiar interactions with process state, that spawnProcess will manage for you.

Among the problems with os.fork are:

  • Forking copies your current process state, but doesn't copy the state of threads. This means that any thread in the middle of modifying some global state will leave things half-broken, possibly holding some locks which will never be released. Don't run any threads in your application? Have you audited every library you use, every one of its dependencies, to ensure that none of them have ever or will ever use a background thread for anything?
  • You might think you're only touching certain areas of your application memory, but thanks to Python's reference counting, any object which you even peripherally look at (or is present on the stack) may have reference counts being incremented or decremented. Incrementing or decrementing a refcount is a write operation, which means that whole page (not just that one object) gets copied back into your process. So forked processes in Python tend to accumulate a much larger copied set than, say, forked C programs.
  • Many libraries, famously all of the libraries that make up the systems on macOS and iOS, cannot handle fork() correctly and will simply crash your program if you attempt to use them after fork but before exec.
  • There's a flag for telling file descriptors to close on exec - but no such flag to have them close on fork. So any files (including log files, and again, any background temp files opened by libraries you might not even be aware of) can get silently corrupted or truncated if you don't manage access to them carefully.
Glyph
  • 31,152
  • 11
  • 87
  • 129
  • As I've noted, I need access to the data in parent process (to the snapshot of the data, even. Which seems to be exactly what fork() provides). Which means, spawning a process from an executable is not useful in here. – HoverHell Nov 02 '12 at 00:09
  • 1
    But access to what data? As it turns out, `os.fork` in any Python program is almost certainly broken (by design). The only exception might be if you instantly call `os.execve` (or one of the similar functions) - which throws away all of your data, just like `spawnProcess` does. So it would help if you changed your question from "How can I use `os.fork`?" to "How can I share data in the form of X with another process?" or perhaps even just go all the way to "How can I persist data in the form of X in a Twisted-using program?" – Jean-Paul Calderone Nov 02 '12 at 01:39
  • I don't have any problems making it persist in general. I'm interested in trying to make it atomic-like in more ways. And no, it doesn't seem to me at all that os.fork is broken. It works quite well in many cases (and also all over the `multiprocessing`) and can be used to process, for example, large numpy arrays with python code on multiple cores without duplicating them in memory. And it can be used the OP way pretty well, but I'm trying to figure out how to do that *better*. – HoverHell Nov 02 '12 at 11:21
  • @HoverHell, it is in fact broken in a number of surprising ways. If what you want to do is share a large numpy array between processes, consider using `numpy.memmap` http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap – Glyph Nov 07 '12 at 00:49
  • @Glyph, any references on how it is broken? – HoverHell Nov 12 '12 at 13:13
  • This is a non answer. OP wants to reasonably leverage the operating system api to very effeciently and simply create a memory snapshot. How is python's fork broken? – Jan Matějka Jun 18 '19 at 11:35
  • @yaccz An *exhaustive* explanation of all the ways `fork()` could go wrong is probably a semester-long 300-level college course on operating systems, but I've added a brief summary of some of the problems to the answer so that this isn't just a blind argument from authority. – Glyph Jun 21 '19 at 16:32