Implementing posix_spawn on Linux

Question

I am curious to see if it would be possible to implement posix_spawn in Linux using a combination of vfork+exec. In a very simplified way (leaving out most optional arguments) this could look more or less like this:

int my_posix_spawn(pid_t *ppid, char **argv, char **env)
{
    pid_t pid;

    pid = vfork();
    if (pid == -1)
        return errno;

    if (pid == 0)
    {
        /* Child */
        execve(argv[0], argv, env);

        /* If we got here, execve failed. How to communicate this to
         * the parent? */
        _exit(-1);
    }

    /* Parent */
    if (ppid != NULL)
        *ppid = pid;

    return 0;
}

However I am wondering how to cope with the case where vfork succeeds (so the child process is created) but the exec call fails. There seems to be no way to communicate this to the parent, which would only see that it could apparently create a child process successfully (as it would get a valid pid back)

Any ideas?

The way to communicate such an error is exit with status 127 as specified in the documentation. — n. m. could be an AI, Aug 14 '14 at 14:52
Another option is to `open` `argv[0]` and verify that it's executable using `fstat` before the `fork`, then `fexecve` it. This would preclude many of the cases that could cause `execve` to fail -- though there are still other cases that you need to handle. — user3553031, Aug 14 '14 at 15:01
@n.m. But does that mean that there is no way to do this without doing calling waitpid or equivalent on the child? — Grodriguez, Aug 14 '14 at 15:01
@user3553031 that is just one of the many things that may go wrong.. — Grodriguez, Aug 14 '14 at 15:02

score 11 · Accepted Answer · edited May 23 '17 at 12:31

As others have noted in the comments, posix_spawn is permitted to create a child process that immediately dies to due to exec failure or other post-fork failures; the calling application needs to be prepared for this. But of course it's preferable not to do so.

The general procedure for communicating exec failure to the parent is described in an answer I wrote on this question: What can cause exec to fail? What happens next?.

Unfortunately, however, some of the operations you need to perform are not legal after vfork due to its nasty returns-twice semantics. I've covered this topic in the past in an article on ewontfix.com. The solution for making a posix_spawn that avoids duplicating the VM seems to be using clone with CLONE_VM (and possibly CLONE_VFORK) to get a new process that shares memory but doesn't run on the same stack. However, this still requires a lot of care to avoid making any calls to libc functions that might modify memory used by the parent. My current implementation is here:

http://git.musl-libc.org/cgit/musl/tree/src/process/posix_spawn.c?id=v1.1.4

and as you can see it's rather complicated. Reading the git history may be informative regarding some of the design decisions that were made.

Extremely insightful. This answers all my questions and then some more. — Grodriguez, Aug 14 '14 at 16:34

zwol · Answer 2 · 2014-08-14T17:31:07.540

0

I don't think there's any good way to do this with the current set of system calls. You've correctly identified the biggest problem -- the absence of any reliable way to report failure after the vfork. Other problems include race conditions in setting child state, and Linux's lack of interest in picking up closefrom.

Several years ago I sketched a new system-level API that would solve this problem: the key addition is a system call, which I called egg(), that creates a process without giving it an address space, and inheriting no state from the parent. Obviously, an egg process can't execute code; but you can (with a whole bunch more new system calls) set all of its kernelside state, and then (with yet another system call, hatch()) load an executable into it and set it going. Crucially, all of the new system calls report failure in the parent. For instance, there's a dup_into(pid, to_fd, from_fd) call that copies parent file descriptor from_fd to egg-state process pid's file descriptor to_fd; if it fails, the parent gets the failure code.

I never had time to flesh all of that out into a coherent API specification and code it up (and I'm not a kernel hacker, anyway) but I still think the concept has legs and I would be happy to work with someone to get it done.

edited Aug 14 '14 at 17:31

answered Aug 14 '14 at 16:17

zwol

135,547
38
252
361

1

Is there any reason this API couldn't be implemented in userspace as I described, marshalling the `dup_into` etc. calls into the `CLONE_VM` process with some sort of IPC? – R.. GitHub STOP HELPING ICE Aug 14 '14 at 17:16
@R.. Off the top of my head, the only thing I can think of that might make that approach *impossible* is that `clone` AFAICT offers no way to say "start this child with *no* open file descriptors". However, it would be subject to all the problems you have with your current approach in musl, and then some. The additional process state and shared-nothing, reset-to-defaults-then-adjust-as-necessary approach makes everything much easier to reason about. – zwol Aug 14 '14 at 17:30
@Zack: "Start with no open file descriptors" is incompatible with POSIX, which allows an implementation to require certain implementation-internal file descriptors be preserved in order to maintain conforming behavior. If you don't care about that, though, you can just close them all yourself, but I think that aspect of the design would be better removed and normal fd inheritance preserved. – R.. GitHub STOP HELPING ICE Aug 14 '14 at 17:52
@R.. I can think of uses for "implementation-internal file descriptors", but I do not see any scenario where they need to remain open across `execve` (after which libc has to reinitialize itself anyway). And basically every program I've ever written that spawns subprocesses has wanted to be able to whitelist, rather than blacklist, the set of file descriptors inherited. So, no, I consider the "egg starts with no open file descriptors" aspect of the design an essential feature. – zwol Aug 14 '14 at 17:56
@Zack: You can view the rationale here: http://austingroupbugs.net/view.php?id=149 – R.. GitHub STOP HELPING ICE Aug 14 '14 at 19:57
> Linux's lack of interest in picking up `closefrom`. 8 years later, Linux now has `close_range`! – Nathan Ringo Jul 30 '22 at 14:10

Implementing posix_spawn on Linux

2 Answers2