3

I have a process that's monitored by its parent. The child encountered an error that caused it to call abort. The process does not tamper with the abort process, so it should proceed as expected (dump core, terminate). The parent is supposed to detect the child's termination and trigger a series of events to respond to the failure. The child is multi-threaded and complex.

Here's what I see from ps:

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0  1000  4929  1272  20   0  85440  6792 wait   S+   pts/2      0:00 rxd
1  1000  4930  4929  20   0      0     0 exit   Zl+  pts/2     38:21 [rxd] <defunct>

So the child (4930) has terminated. It is a zombie. I cannot attach to it, as expected. However, the parent (4929) stays blocked in:

int i;
// ...
waitpid (-1, &i, 0);

So it seems like the child is a zombie but somehow has not completed everything necessary for its parent to reap it. The WCHAN field of exit is, I think, a valuable clue.

The platform is 64-bit Linux, Ubuntu 13.04, kernel 3.8.0-30. The child doesn't appear to be dumping core or doing anything. I've left the system for several minutes and nothing changed.

Does anyone have any ideas what might be causing this or what I can do about it?

Update: Another interesting bit of information -- if I kill -9 the parent process, the child goes away. This is kind of baffling, since the parent process is trivial, just blocking in waitpid. Also, I don't get any core dump (from the child) when this problem happens.

Update: It seems the child is stuck in schedule, called from exit_mm, called from do_exit. I wonder why exit_mm would call schedule. And I wonder why killing the parent would unstick it.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • What gives you `ps -eo wchan,pid | grep 4930` ? – hek2mgl Sep 27 '13 at 20:44
  • The `WCHAN` field is up there. The child is in `exit` and the parent is in `wait`. I believe the child is somehow stuck in the kernel's `exit` function, unable to complete the process of fully terminating. – David Schwartz Sep 27 '13 at 20:50
  • Oh yes, I see now. Do you use signal handlers in the child process? Can it be that it hangs on blocking IO during exit? Would be nice to see to code of the child, but you told it is too complex.. No chance to break it down to simpler code? – hek2mgl Sep 27 '13 at 20:56
  • @hek2mgl I cannot replicate it under simple conditions. I supposed it's possible that it's hanging on I/O, but what I/O would the kernel be doing on `exit` after the mappings are gone? We do use signal handlers, but the call to `std::_exit` has taken place and the kernel has taken over the termination process (since the process is a zombie, we know it's not running any user-space code). – David Schwartz Sep 27 '13 at 21:13
  • 1
    can you reproduce it with other kernels? (other distributions, self compiled, ...) ? – hek2mgl Sep 27 '13 at 21:19
  • @hek2mgl I haven't keep track of where exactly it's happening since this has been a fairly rare occurance. I just pulled the information off one machine I know for sure it happened on while it was happening. I'll start keeping track when it happens again. – David Schwartz Sep 27 '13 at 21:22
  • Good Luck! :) ... If you have news it would be nice if you could drop a comment here.. I like such problems.. – hek2mgl Sep 27 '13 at 21:23
  • I just reproduced it with another kernel. This time I got some [useful output](http://pastebin.com/ZUFDaxNs). – David Schwartz Sep 28 '13 at 03:17
  • From what do you conclude the child called `_exit()`? – alk Sep 28 '13 at 11:21
  • @alk Actually, I think that was incorrect. The process is being killed by a fatal signal. – David Schwartz Sep 29 '13 at 00:07
  • It might be interesting to know where exactly the `wait()`ing parent got stuck. And btw: Does the zombie child has/had own children? – alk Sep 29 '13 at 09:26
  • @alk The parent is blocked in `waitpid` in user-space and `wait` in the kernel. The zombie child does have a child that shows up in `pstree` but no place else. I think it's another thread (since they both show up in the same process in `ps m`) that's also waiting in the kernel for the core dump to complete, but there is no core dump. – David Schwartz Sep 29 '13 at 13:36
  • Do you have a stack trace for the parent? – alk Sep 29 '13 at 13:40
  • @alk Yes. It's blocked in `waitpid (-1, &i, 0);` in user-space and in `exit` in the kernel. – David Schwartz Sep 29 '13 at 14:10

1 Answers1

8

I finally figured it out! The process was actually doing useful work all this time. The process held the last reference to a large file on a slow filesystem. When the process terminates, the last reference to the file is release, forcing the OS to reclaim the space. The file was so large that this required tens of thousands of I/O operations, taking 10 minutes or more.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278