0

At seemingly random intervals, the memory usage on our server is increasing over the maximum available and swapping until the CPU usage is also 100%. It then starts killing off processes when it runs out of swap memory and we have to restart the server.

When this happens our website and internal systems become unresponsive. I also cannot SSH into the server at this point so I have no way of identifying the processes which are killing the it.

I don't have a huge amount of experience with server admin but I'm looking for ideas for how to detect the problem. Let me know what extra information you may need.

Starky
  • 103
  • 2

2 Answers2

1

Could be a fork-bomb tbh (i.e. a process that's infinitely forking children and hence exhausting the resources). Could also be a memory leak type issue.

Identifying the key process(es) is key here. Try this:

When you next restart the server leave a console open as root but use renice to set its priority to -20. Once that's done run (top with priority -20) and watch to see what's causing the issue.

This command ought to do it:

sudo bash
renice -n -20 -u root
top

When things start looking tight resort to the killall command or kill the parent and then the zombies.

At -20 you should be able to keep an active connection over ssh and still do your work, its same priority as the Kernel.

Don't forget to look in the logs (web server and otherwise in /var/log) as well since they can be quite revealing.

If you identify the problem let us know what it is and if you require further help and assistance.

Good luck.

See the renice man page and top man page.

tiredone
  • 63
  • 1
  • 6
  • I got the following message when running renice:"renice: -20: bad value" followed by "0: old priority -8, new priority 0" – Starky Apr 10 '13 at 11:00
  • Hmm. Try this renice command instead: renice -20 -u root. – tiredone Apr 10 '13 at 11:13
  • If you still can't get to -20, -8 should okay enough but probably not as high as the Kernel itself which would be ideal. – tiredone Apr 10 '13 at 11:16
  • That second command did the trick, thanks. Will wait for the next incident then and see if I can still get a live view from top. If so I'll mark the answer as accepted. – Starky Apr 10 '13 at 12:34
0

Install (and read the documentation carefully!) sysstat, configure it and analyze the collected data after such an incident.

Review the security policies in place (SELinux active, ulimit for the various users, ...). Check that everything is up to date (a malfunctioning program certainly can cause this).

Check any homebrew systems for possible loops or other resource exhaustion. Real all logs, even for databases and such.

vonbrand
  • 1,149
  • 2
  • 8
  • 16
  • Thanks for the info - going to try monitoring top first and if that doesn't help we'll take a more detailed look at the logs. I've only reviewed the apache access log so far, I should look into the error log (too big right now) and mysql/php in the future. – Starky Apr 10 '13 at 12:35