0

I am porting a debugger, 'pi' ('process inspector') to Linux and am working on the code for fork/exec of a child to inspect it. I am following standard procedure (I believe) but the wait is hanging. 'hang' is the procedure which does the work, the 'cmd' argument being the name of the binary (a.out) to trace:

int Hostfunc::hang(char *cmd){
    char *argv[10], *cp;
    int i;
    Localproc *p;
    struct exec exec;
    struct rlimit rlim;
    
    i = strlen(cmd);
    if (++i > sizeof(procbuffer)) {
        i = sizeof(procbuffer) - 1;
        procbuffer[i] = 0;
    }
    bcopy(cmd, procbuffer, i);
    argv[0] = cp = procbuffer;
    for(i = 1;;) {
        while(*cp && *cp != ' ')
            cp++;
        if (!*cp) {
            argv[i] = 0;
            break;
        } else {
            *cp++ = 0;
            while (*cp == ' ')
                cp++;
            if (*cp)
                argv[i++] = cp;
        }
    }
    hangpid = fork();
    if (!hangpid){
        int fd, nfiles = 20;
        if(getrlimit(RLIMIT_NOFILE, &rlim))
            nfiles = rlim.rlim_cur;
        for( fd = 0; fd < nfiles; ++fd )
            close(fd);
        open("/dev/null", 2);
        dup2(0, 1);
        dup2(0, 2);
        setpgid(0, 0);
        ptrace(PTRACE_TRACEME, 0, 0, 0);
        execvp(argv[0], argv);
        exit(0);
    }
    if (hangpid < 0)
        return 0;
    p = new Localproc;
    if (!p) {
        kill(9, hangpid);
        return 0;
    }
    p->sigmsk = sigmaskinit();
    p->pid = hangpid;
    if (!procwait(p, 0)) {
        delete p;
        return 0;
    }
    if (p->state.state == UNIX_BREAKED)
        p->state.state = UNIX_HALTED;
    p->opencnt = 0;
    p->next = phead;
    phead = p;
    return hangpid;
}

I put the 'abort()' in to catch a non-zero return from ptrace, but that is not happening. The call to 'raise' seems to be a common practice but a cursory look at gdb's code reveals it is not used there. In any case it makes no difference to the outcome. `procwait' is as follows:

int Hostfunc::procwait(Localproc *p, int flag){
    int tstat;
    int cursig;

again:
    if (p->pid != waitpid(p->pid, &tstat, (flag&WAIT_POLL)? WNOHANG: 0))
        return 0;
    if (flag & WAIT_DISCARD)
        return 1;
    if (WIFSTOPPED(tstat)) {
        cursig = WSTOPSIG(tstat);
        if (cursig == SIGSTOP)
            p->state.state = UNIX_HALTED;
        else if (cursig == SIGTRAP)
            p->state.state = UNIX_BREAKED;
        else {
            if (p->state.state == UNIX_ACTIVE &&
                !(p->sigmsk&bit(cursig))) {
                ptrace(PTRACE_CONT, p->pid, 1, cursig, 0);
                goto again;
            }
            else {
                p->state.state = UNIX_PENDING;
                p->state.code = cursig;
            }
        }
    } else {
        p->state.state = UNIX_ERRORED;
        p->state.code = WEXITSTATUS(tstat) & 0xFFFF;
    }
    return 1;
}

The 'waitpid' in 'procwait' just hangs. If I run the program with the above code, and run a 'ps', I can see that 'pi' has forked but hasn't yet called exec, because the command line is still 'pi', and not the name of the binary I am forking. I discovered that if I remove the 'raise', 'pi' still hangs but 'ps' now shows that the forked program has the name of the binary being examined, which suggests it has performed the exec.

So, as far as I can see, I am following documented procedures to take control of a forked process but it isn't working.

Noel Hunt

N. Hunt
  • 51
  • 7
  • Can you turn this into a [mcve], that somebody could actually compile, run and test? – Nate Eldredge Jun 25 '20 at 05:13
  • I don't see a `raise()` in this version of the code. Note that `waitpid` will only return when the child process stops, and as the code stands, I don't see anything that would make it stop. When were you hoping it would stop? – Nate Eldredge Jun 25 '20 at 05:18
  • To answer the question about turning this into a minimal reproducible example, the problem is, very straigtforward examples, such as can be found on the net, work. That is, parent forks, in the child it calls 'ptrace(PTRACE_TRACEME,0,0) then exec's something , say 'ls -l'. The parent on the other hand does a wait. When the wait returns the parent is now in control, can set registers etc. etc. – N. Hunt Jun 25 '20 at 05:24
  • I must have written up the version without 'raise()' but as I said, it doesn't change the outcome, except that with 'raise', the child's argv[] is the name of 'pi' itself, meaning that the child hasn't exec'd; if I remove 'raise', the wait still hangs, but ps shows that argv in the child now has the name of the program exec'd in the child. This suggests that the SIGSTOP is stopping the forked child from even getting to the exec, but I have seen 'raise' used in examples on the net in this context. Still, the problem is wait; why is it not returning? – N. Hunt Jun 25 '20 at 05:32
  • As to your comment about not seeing anything that would make it stop, my understanding is that the PTRACE_TRACEME actually causes the child to stop at the first exec. – N. Hunt Jun 25 '20 at 05:32
  • From the manual entry on ptrace: A process can initiate a trace by calling fork(2) and having the resulting child do a PTRACE_TRACEME, followed (typically) by an execve(2). Hmm, I am not using execve, but I don't see why that would matter. – N. Hunt Jun 25 '20 at 05:38
  • I see, you're right, it should stop on exec. I ask about the minimal reproducible example because I'm afraid the problem may be in the code that you didn't show. – Nate Eldredge Jun 25 '20 at 05:41
  • As it now stands, I have forked a child, and wait is hanging. I don't know a lot about all the files in /proc, but the 'wchan' file shows that the forked child is in 'ptrace_stop'. I expected the parent to be able to get information about a stopped child via wait, but it's hanging for some reason. – N. Hunt Jun 25 '20 at 05:57
  • I guess I'm not sure how to help you with code that I can't see and can't test. As you point out, when we write a test program that does what **you claim** your code is doing, it works. Ergo, I suspect your code is not doing what you claim it's doing, though I don't yet know why not. So I would suggest trying again to make a minimal example, but from the other direction: start with your non-working program, and remove or stub out anything that isn't related to the question. – Nate Eldredge Jun 25 '20 at 06:00
  • The problem is getting murkier. I have substitued a simple 'wait(0)' for the wait call and keep the return ( int rv = wait(0); ) this way, I can look at the code while running it under dbx (I am using an Oracle Linux, their compilers and dbx; gdb would work too). I put a breakpoint at the wait call, then run the debugger. If I try to 'step' the wait call in the debugger it hangs, as expected. I then send a SIGKILL to the child from another window and...the wait still hangs. So I am stumped. I get the feeling there is some nice little Linux feature that I need to turn on. – N. Hunt Jun 25 '20 at 06:15
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/216616/discussion-between-n-hunt-and-nate-eldredge). – N. Hunt Jun 25 '20 at 06:58
  • My apologies; there was a SIGCHLD handler interfering by restarting the wait system call. Thanks for your various comments. – N. Hunt Jun 25 '20 at 07:05
  • Ah, the usual story: the bug is always where you least expect it :-) Glad you got it solved. – Nate Eldredge Jun 25 '20 at 13:59

1 Answers1

2

I have found the problem (with my own code, as Nate pointed out), but the cause was obscure until I ran 'strace pi'. It was clear from that that there was a SIGCHLD handler, and it was executing a wait. The parent enters wait, SIGCHLD is delivered, the handler waits and thus reaping the status of the child, then wait is restarted in the parent and hangs because there is no longer any change of state. The SIGCHLD handler makes sense because the pi wants to be informed of state changes in the child. The first version of 'pi' I got working was a Solaris version, and it uses /proc for process control so there was no use of 'wait' to get child status, hence I didn't see this problem in the Solaris version.

N. Hunt
  • 51
  • 7