4

On our cluster we would sometimes have nodes go down when a new process would request too much memory. I was puzzled why the OOM killer does not just kill the guilty process.

The reason turned out to be that some processes get -17 oom_adj. That makes them off-limits for OOM killer (unkillabe!).

I can clearly see that with the following script:

#!/bin/bash
for i in `grep -v 0 /proc/*/oom_adj | awk -F/ '{print $3}' | grep -v self`; do
  ps -p $i | grep -v CMD
done

OK, it makes sense for sshd, udevd, and dhclient, but then I see regular user processes get -17 as well. Once that user process causes an OOM event it will never get killed. This causes OOM kiler to go insane. NFS rpc.statd, cron, everything that happened to to be not -17 will be wiped out. As a result the node is down.

I have Debian 6.0 (Linux 2.6.32-3-amd64).

Does anyone know where to contorl the -17 oom_adj assignment behaviour?

Could launching sshd and Torque mom from /etc/rc.local be causing the overprotective behaviour?

Aleksandr Levchuk
  • 2,465
  • 3
  • 22
  • 41

2 Answers2

2

It gets inherited from the process that spawned it. If SSH is set to -17 then Bash will be. If you restart via Bash, you'll spawn it even further.

[i-180ae177] root@migrantgeek ~ # pgrep mysqld_safe
11395
[i-180ae177] root@migrantgeek ~ # cat /proc/11395/oom_adj 
0
[i-180ae177] root@migrantgeek ~ # for pid in `pgrep bash`; do echo -17 >  /proc/$pid/oom_adj; done
[i-180ae177] root@migrantgeek ~ # /etc/init.d/mysqld  restart
Stopping MySQL:                                            [  OK  ]
Starting MySQL:                                            [  OK  ]
[i-180ae177] root@migrantgeek ~ # pgrep mysqld_safe
11523
[i-180ae177] root@migrantgeek ~ # cat /proc/11523/oom_adj 
-17

Editing the init script to change the value at the end of the startup process should fix this.

Jim
  • 398
  • 2
  • 9
  • So -17 on init (pid 0) is a bad idea? – Aleksandr Levchuk May 21 '11 at 01:21
  • @Seth, I think you are right. I will test now and then mark this as the correct answer. – Aleksandr Levchuk May 21 '11 at 01:42
  • @Seth, I un-17'd all processes but still when I open a new ssh connection the sshd and the shell get -17 – Aleksandr Levchuk May 21 '11 at 02:13
  • Hmmm. Can you run the following? – Jim May 21 '11 at 04:27
  • for pid in `pgrep ssh`; do cat /proc/$pid/oom_adj; done – Jim May 21 '11 at 04:27
  • I do not like this comment system :) The reason I ask is because SSH is probably still running as -17. I would ensure all OpenSSH processes are set to 0. – Jim May 21 '11 at 04:30
  • I don't think this is the best solution. SSH is good to leave at -17 in case you need access again. I would alter the init script to force 0 or another value on the process instead. – Jim May 21 '11 at 04:31
  • I make sshd processes (verified with `pgrep sshd`) oom_adj = 0. When I open a new ssh connection, new processes becomes -17. I think it hard-coded in sshd. – Aleksandr Levchuk May 21 '11 at 07:43
  • Or something else is spawning the Bash shell? Do a "ps faux" and see everything along the chain that spawns your shell. This may not really be needed though. As per the question it seems you really just want to set oom_adj for NFS rpc.statd, cron, etc. I would simply modify the init script and append a line at the bottom that sets it after startup each time. Then SSH and Bash can remain -17 which is good in case of a spike. – Jim May 21 '11 at 18:11
  • Also, you may want to increase swap space to prevent OOM Killer in the first place. It's good to adjust it but I think it's best to never have it come around in the first place. – Jim May 21 '11 at 18:12
  • @Seth, there is nothing between Bash and SSHD parent-child. We occasional get users who start a computational task that would take 500G of memory if it could. I don't even have that much HDD space on the compute nodes. – Aleksandr Levchuk May 21 '11 at 21:22
  • I'm very confident that -17 oom_adj is at the core of this issue. – Aleksandr Levchuk May 21 '11 at 21:23
  • There is a bug in sshd which causes oom_adj to be -17 for all child sessions after performing a reload. This is the reason. The bug is fixed in the vary latet upstream commit. – Matthew Ife Sep 23 '13 at 09:35
2

On our clusters we disable overcommit with sysctl:

vm.overcommit_ratio=60
vm.overcommit_memory=2

You should fix the ratio depending on how much memory and swap you have.

Once overcommit is disabled the kernel just returns NULL to the process that is trying to allocate too much memory. It solved all our memory crashes on the cluster nodes.

Daniel
  • 1,713
  • 1
  • 12
  • 16
  • I would caution against using overcommit - on a mixed web app + DB (Python + PostgreSQL) system, with background processes also allocating quite a lot of RAM, the overcommit=2 setting with ratio=100 just changed the OOM killing behaviour into many fork() failures (no more COW forking i.e. vfork), and many Python memory allocation errors (MemoryError). There wasn't really enough RAM on this system, but due to the forking issue and the need for some apps to allocate more virtual memory than will ever be used, the cure was worse than the disease. The solution was to add more RAM and/or swap. – RichVel Apr 04 '13 at 14:03