5

I am curious to see if it would be possible to implement posix_spawn in Linux using a combination of vfork+exec. In a very simplified way (leaving out most optional arguments) this could look more or less like this:

int my_posix_spawn(pid_t *ppid, char **argv, char **env)
{
    pid_t pid;

    pid = vfork();
    if (pid == -1)
        return errno;

    if (pid == 0)
    {
        /* Child */
        execve(argv[0], argv, env);

        /* If we got here, execve failed. How to communicate this to
         * the parent? */
        _exit(-1);
    }

    /* Parent */
    if (ppid != NULL)
        *ppid = pid;

    return 0;
}

However I am wondering how to cope with the case where vfork succeeds (so the child process is created) but the exec call fails. There seems to be no way to communicate this to the parent, which would only see that it could apparently create a child process successfully (as it would get a valid pid back)

Any ideas?

Grodriguez
  • 21,501
  • 10
  • 63
  • 107

2 Answers2

11

As others have noted in the comments, posix_spawn is permitted to create a child process that immediately dies to due to exec failure or other post-fork failures; the calling application needs to be prepared for this. But of course it's preferable not to do so.

The general procedure for communicating exec failure to the parent is described in an answer I wrote on this question: What can cause exec to fail? What happens next?.

Unfortunately, however, some of the operations you need to perform are not legal after vfork due to its nasty returns-twice semantics. I've covered this topic in the past in an article on ewontfix.com. The solution for making a posix_spawn that avoids duplicating the VM seems to be using clone with CLONE_VM (and possibly CLONE_VFORK) to get a new process that shares memory but doesn't run on the same stack. However, this still requires a lot of care to avoid making any calls to libc functions that might modify memory used by the parent. My current implementation is here:

http://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c?id=v1.1.4

and as you can see it's rather complicated. Reading the git history may be informative regarding some of the design decisions that were made.

Community
  • 1
  • 1
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
0

I don't think there's any good way to do this with the current set of system calls. You've correctly identified the biggest problem -- the absence of any reliable way to report failure after the vfork. Other problems include race conditions in setting child state, and Linux's lack of interest in picking up closefrom.

Several years ago I sketched a new system-level API that would solve this problem: the key addition is a system call, which I called egg(), that creates a process without giving it an address space, and inheriting no state from the parent. Obviously, an egg process can't execute code; but you can (with a whole bunch more new system calls) set all of its kernelside state, and then (with yet another system call, hatch()) load an executable into it and set it going. Crucially, all of the new system calls report failure in the parent. For instance, there's a dup_into(pid, to_fd, from_fd) call that copies parent file descriptor from_fd to egg-state process pid's file descriptor to_fd; if it fails, the parent gets the failure code.

I never had time to flesh all of that out into a coherent API specification and code it up (and I'm not a kernel hacker, anyway) but I still think the concept has legs and I would be happy to work with someone to get it done.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • 1
    Is there any reason this API couldn't be implemented in userspace as I described, marshalling the `dup_into` etc. calls into the `CLONE_VM` process with some sort of IPC? – R.. GitHub STOP HELPING ICE Aug 14 '14 at 17:16
  • @R.. Off the top of my head, the only thing I can think of that might make that approach *impossible* is that `clone` AFAICT offers no way to say "start this child with *no* open file descriptors". However, it would be subject to all the problems you have with your current approach in musl, and then some. The additional process state and shared-nothing, reset-to-defaults-then-adjust-as-necessary approach makes everything much easier to reason about. – zwol Aug 14 '14 at 17:30
  • @Zack: "Start with no open file descriptors" is incompatible with POSIX, which allows an implementation to require certain implementation-internal file descriptors be preserved in order to maintain conforming behavior. If you don't care about that, though, you can just close them all yourself, but I think that aspect of the design would be better removed and normal fd inheritance preserved. – R.. GitHub STOP HELPING ICE Aug 14 '14 at 17:52
  • @R.. I can think of uses for "implementation-internal file descriptors", but I do not see any scenario where they need to remain open across `execve` (after which libc has to reinitialize itself anyway). And basically every program I've ever written that spawns subprocesses has wanted to be able to whitelist, rather than blacklist, the set of file descriptors inherited. So, no, I consider the "egg starts with no open file descriptors" aspect of the design an essential feature. – zwol Aug 14 '14 at 17:56
  • @Zack: You can view the rationale here: http://austingroupbugs.net/view.php?id=149 – R.. GitHub STOP HELPING ICE Aug 14 '14 at 19:57
  • > Linux's lack of interest in picking up `closefrom`. 8 years later, Linux now has `close_range`! – Nathan Ringo Jul 30 '22 at 14:10