This is on StackOverflow as opposed to SuperUser/ServerFault since it has to do with the syscalls and OS interactions being performed by sshd, not the problem I'm having using SSH (though assistance with that is appreciated, too :p).
Context:
I invoke a complex series of scripts via SSH, e.g. ssh user@host -- /my/command
. The remote command does a lot of complex forking and execcing and eventually results in a backgrounded daemon process running on the remote host. Occasionally (I'm slowly going mad trying to find out reliable reproduction conditions), the ssh
command will never return control to the client shell. In those situations, I can go onto the target host and see an sshd: user@notty
process with no children hanging indefinitely.
Fixing that issue is not what this question is about. This question is about what that sshd
process is doing.
The SSH implementation is OpenSSH, and the version version is 5.3p1-112.el6_7.
The problem:
If I find one of those stuck sshd
s and strace
it, I can see it's doing a select on two handles, e.g. select(12, [3 6], [], NULL, NULL
or similar. lsof
tells me that one of those handles is the TCP socket connecting back to the SSH client. The other is a pipe, the other end of which is only open in the same sshd
process. If I search for that pipe by ID using the answer to this SuperUser question, the only process that contains references to that pipe is the same process. lsof
confirms this: both the read and write ends of the pipe are open in the same process, e.g. (for pipe 788422703 and sshd
PID 22744):
sshd 22744 user 6r FIFO 0,8 0t0 788422703 pipe
sshd 22744 user 7w FIFO 0,8 0t0 788422703 pipe
Questions:
What is SSH waiting for? If the pipe isn't connected to anything and there are no child processes, I can't imagine what event it could be expecting.
What is that "looped" pipe/what does it represent? My only theory is that maybe if STDIN isn't supplied to the SSH client, the target host sshd
opens a dummy STDIN pipe so some of its internal child-management code can be more uniform? But that seems pretty tenuous.
How does SSH get into this situation?
What I've Tried/Additional Info:
- Initially, I thought this was a handle leak to a daemon. It's possible to create a waiting, child-less
sshd
process by issuing a command that backgrounds itself, e.g.ssh user@host -- 'sleep 60 &'
;sshd
will wait for the streams to be closed to the daemonized process; not just the exit of its immediate child. Since the scripts in question eventually result (way down the process tree) in a daemon being started, it initially seemed possible that the daemon was holding onto a handle. However, that doesn't seem to hold up--using thesleep 60 &
command as an example,sshd
processes communicating with daemons hold and select on four open pipes, not just two, and at least two of the pipes are connected fromsshd
to the daemon process, not looped. Unless there's a method of tracking/pointing to a pipe I don't know about (and there likely is--for example, I have no idea howdup
ed filehandles play intoclose()
semaphore waiting or piping), I don't think the pipe-to-self situation represents a waiting-on-daemon case. sshd
periodically receives communication on the TCP socket/ssh connection itself, which wakes it up out of theselect
s for a brief period of communication (during whichstrace
shows it blocking SIGCHLD), and then it goes back to waiting on the same FDs.- It's possible that I'm being affected by this race condition (SIGCHLD getting delivered before the kernel makes data available in the pipe). However, that seems unlikely, both given the rate at which this condition manifests, and the fact that the processes being run on the target host are Perl scripts, and the Perl runtime closes and flushes open file descriptors on shutdown.