4

The output of ps aux contains the following:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
ubuntu    1496  9.1  0.0      0     0 pts/0    Z+   19:47   0:30 [python] <defunct>
ubuntu    1501 14.6  0.0      0     0 pts/0    Z+   19:47   0:48 [python] <defunct>
ubuntu    1502 14.8  0.0      0     0 pts/0    Z+   19:47   0:48 [python] <defunct>
ubuntu    1503 15.1  0.0      0     0 pts/0    Z+   19:47   0:49 [python] <defunct>
ubuntu    1504 15.4  0.0      0     0 pts/0    Z+   19:47   0:50 [python] <defunct>
ubuntu    1505 15.8  0.0      0     0 pts/0    Z+   19:47   0:52 [python] <defunct>
ubuntu    1506 16.0  0.0      0     0 pts/0    Z+   19:47   0:53 [python] <defunct>
ubuntu    1507 14.1  0.0      0     0 pts/0    Z+   19:47   0:46 [python] <defunct>
ubuntu    1508 14.3  0.0      0     0 pts/0    Z+   19:47   0:47 [python] <defunct>
ubuntu    1509 14.4  0.0      0     0 pts/0    Z+   19:47   0:47 [python] <defunct>
ubuntu    1510 14.6  0.0      0     0 pts/0    Z+   19:47   0:48 [python] <defunct>
ubuntu    1511 14.9  0.0      0     0 pts/0    Z+   19:47   0:49 [python] <defunct>
ubuntu    1512 10.7  0.0      0     0 pts/0    Z+   19:47   0:35 [python] <defunct>
ubuntu    1513 71.3  0.0      0     0 pts/0    Z+   19:47   3:55 [python] <defunct>

These are a bunch of processes spawned via multiprocessing that have finished and are waiting to be joined by the parent. Why are they taking up CPU?

If this just an artifact of ps, how can I get an accurate view of how much CPU is being used?

Zags
  • 37,389
  • 14
  • 105
  • 140
  • See my answer. Note that there's a difference between "showing accumulated CPU utilization" and "taking up CPU". Run ps a couple of times to see if your TIME increases. If it does, you may want to look deeper. – Mike S Jun 04 '21 at 23:51

3 Answers3

4

A zombie process (i.e. one that is 'defunct') does not consume CPU: it is simply retained by the kernel so that the parent process can retrieve information about it (e.g. return status, resource usage, etc...).

The CPU usage indicated by the ps command corresponds to the CPU usage whilst the process was running: that is, before it terminated and became a zombie.

isedev
  • 18,848
  • 3
  • 60
  • 59
  • 2
    Im currently looking at a defunct linux process that has increasing CPU usage time. So it can happen but it is probably because of driver or kernel bug. – Mattias Wadman Jan 25 '16 at 14:25
1

Those are Zombie processes as indicated by the Z in the stat column - they won't be cleaned up until their parent process is terminated. I don't know much about python but presumably you called fork or similar within your python interpreter to spawn them. Kill the interpreter and the zombies will be reaped (cleaned up).

Try the "top" command if you want up to date info on CPU.

Also as an aside I prefer ouput from "ps -ef" rather then "ps aux" aux always struck me as a nonstandard hack (hence lack of a '-' for seperating command and argument) it also fails to work on a lot of other Unix systems like HPUX, AIX etc.

"ps -ef" shows ppid (parent pid) which helps you track down problems like this.

Matt
  • 213
  • 1
  • 8
  • ps has the advantage that it's output can be piped to `grep` to look for various patters (like "python"). How would one do this with top? – Zags Jul 21 '14 at 01:43
  • Yes but as the comment above says the output from PS is not accurate with respect to CPU usage (i.e it shows usage when process was running - up to the point it was zombified - not current usage). Top is a simple way to get a more 'accurate' indication of CPU usage. You can query the various /proc entries (which top does internally anyway). something like "cat /proc/stat" (or /proc//stat for individual processes). This can be piped to grep but you probably have to do some math to calc the "%" - not on a linux box to test currently. – Matt Aug 01 '14 at 03:03
  • @Matt the `ps` in my answer seemed accurate wrt CPU usage. It was certainly increasing in CPU usage. I believe it just gleans the data from /proc, same as top. Try running `strace -f -e open,openat ps`. – Mike S Jun 04 '21 at 23:47
1

Interestingly, and perhaps confusingly, I have a zombie process as of this moment which is accumulating CPU time on my system. So the question is, why? Common wisdom is that any output from ps which shows a zombie process means that the only thing in use is the process table entry; from wikipedia: "...zombie process or defunct process is a process that has completed execution (via the exit system call) but still has an entry in the process table: it is a process in the 'Terminated state'. " and from unix.stackexchange: https://unix.stackexchange.com/questions/11172/how-can-i-kill-a-defunct-process-whose-parent-is-init "Zombie processes take up almost no resouces so there is no performance cost in letting them linger."

So I have a zombie process:

# ps -e -o pid,ppid,stat,comm| grep Z
 7296     1 Zl   myproc <defunct>

Which appears to be using CPU time:

# ps -e -o pid,ppid,bsdtime,stat,comm| grep Z; sleep 10; ps -e -o pid,ppid,bsdtime,stat,comm | grep Z
 7296     1  56:00 Zl   myproc <defunct>
 7296     1  56:04 Zl   myproc <defunct>

So how can a Zombie process accumulate CPU time?

I changed my search:

# ps -eT -o pid,lwp,ppid,bsdtime,stat,comm| grep 7296 
 7296  7296     1   1:29 Zl   myproc <defunct>
 7296  8009     1  56:11 Dl   myproc

and I see that I have a thread that is running, and using system i/o. Indeed, if I do this, I can see field 15 (stime) changing:

# watch -d -n 1 cat /proc/8009/stat
Every 1.0s: cat /proc/8009/stat                  Fri Jun  4 11:19:55 2021

8009 (myproc) D 1 7295 7295 0 -1 516 18156428 12281 37 0 11609 344755

(trimmed at field 15)

So I attempt to kill process 8009 with a TERM... didn't work. Killing it with a KILL is fruitless as well.

Sounds like a kernel bug to me. I did try to strace it, which was foolish because now my strace won't exit.

This is on RHEL 7.7 with kernel 3.10.0-1062. Old at this time, but young enough to conclude (in my mind) that a Zombie process could accumulate system resources due to a bug somewhere.

By the way, according to iotop our i/o as peaking at 4 GBps, which is a lot. I think this thing is definitely having an impact on our system and I want to reboot.

ls output of /proc/8009 returns this:

# ls -l /proc/8009
ls: cannot read symbolic link /proc/8009/cwd: No such file or directory
ls: cannot read symbolic link /proc/8009/root: No such file or directory
ls: cannot read symbolic link /proc/8009/exe: No such file or directory

(normal /proc/pid output follows... but I trimmed it)

/proc/8009/fd is empty. So even though I have a significant amount of i/o taking place, it's not writing to any files. I don't see filesystem space getting used, as show by df -h output.

Finally: trying to reboot is proving impossible. shutdown -r now is not working. There are a couple of systemd processes that are stuck in i/o wait:

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
22725 root       20   0  129M  2512  1548 R  0.0  0.0  0:00.19 htop
22227 root       20   0  195M  4776  2652 D  0.0  0.0  0:00.00 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
    1 root       20   0  195M  4776  2652 D  0.0  0.0  0:58.41 /usr/lib/systemd/systemd --switched-root --system --deserialize 22

Here's shutdown output. I'd say init is quite confused at this point:

# shutdown -r now
Failed to open /dev/initctl: No such device or address
Failed to talk to init daemon.

reboot says the same thing. I'm gonna have to pull the plug on this machine.

...Update: Just as I logged into the console, the system rebooted! It probably took about 10 minutes. So I don't know what systemd was doing but it was doing something.

...Another update: So I have 3 machines that this happened to today, all sharing the same characteristics: Same binary, some sort of behavior (no open file descriptors, but i/o taking place, two threads, child thread is accumulating CPU time). As @Stephane Chazelas mentioned, I performed a stack trace. Here's a typical output; I'm not very kernel-savvy but perhaps it's of interest to some interloper in the future... note that 242603 is the parent thread, 242919 is the child that's busy:

# grep -H . /proc/242919/task/*/stack
/proc/242919/task/242603/stack:[<ffffffff898a131e>] do_exit+0x6ce/0xa50
/proc/242919/task/242603/stack:[<ffffffff898a171f>] do_group_exit+0x3f/0xa0
/proc/242919/task/242603/stack:[<ffffffff898b252e>] get_signal_to_deliver+0x1ce/0x5e0
/proc/242919/task/242603/stack:[<ffffffff8982c527>] do_signal+0x57/0x6f0
/proc/242919/task/242603/stack:[<ffffffff8982cc32>] do_notify_resume+0x72/0xc0
/proc/242919/task/242603/stack:[<ffffffff89f8c23b>] int_signal+0x12/0x17
/proc/242919/task/242603/stack:[<ffffffffffffffff>] 0xffffffffffffffff
/proc/242919/task/242919/stack:[<ffffffffc09cbb03>] ext4_mb_new_blocks+0x653/0xa20 [ext4]
/proc/242919/task/242919/stack:[<ffffffffc09c0a36>] ext4_ext_map_blocks+0x4a6/0xf60 [ext4]
/proc/242919/task/242919/stack:[<ffffffffc098fcf5>] ext4_map_blocks+0x155/0x6e0 [ext4]
/proc/242919/task/242919/stack:[<ffffffffc0993cfa>] ext4_writepages+0x6da/0xcf0 [ext4]
/proc/242919/task/242919/stack:[<ffffffff899c8d31>] do_writepages+0x21/0x50
/proc/242919/task/242919/stack:[<ffffffff899bd4b5>] __filemap_fdatawrite_range+0x65/0x80
/proc/242919/task/242919/stack:[<ffffffff899bd59c>] filemap_flush+0x1c/0x20
/proc/242919/task/242919/stack:[<ffffffffc099116c>] ext4_alloc_da_blocks+0x2c/0x70 [ext4]
/proc/242919/task/242919/stack:[<ffffffffc098a4d9>] ext4_release_file+0x79/0xc0 [ext4]
/proc/242919/task/242919/stack:[<ffffffff89a4a9cc>] __fput+0xec/0x260
/proc/242919/task/242919/stack:[<ffffffff89a4ac2e>] ____fput+0xe/0x10
/proc/242919/task/242919/stack:[<ffffffff898c1c0b>] task_work_run+0xbb/0xe0
/proc/242919/task/242919/stack:[<ffffffff898a0f24>] do_exit+0x2d4/0xa50
/proc/242919/task/242919/stack:[<ffffffff898a171f>] do_group_exit+0x3f/0xa0
/proc/242919/task/242919/stack:[<ffffffff898b252e>] get_signal_to_deliver+0x1ce/0x5e0
/proc/242919/task/242919/stack:[<ffffffff8982c527>] do_signal+0x57/0x6f0
/proc/242919/task/242919/stack:[<ffffffff8982cc32>] do_notify_resume+0x72/0xc0
/proc/242919/task/242919/stack:[<ffffffff89f8256c>] retint_signal+0x48/0x8c
/proc/242919/task/242919/stack:[<ffffffffffffffff>] 0xffffffffffffffff
Mike S
  • 1,235
  • 12
  • 19