5

I have a CentOS 5 instance running on Amazon EC2. The normal CPU usage hovers around 10-20%. About 4 times in the past week, however, CPU usage has suddenly shot up to 100% and just stayed at a constant 100% until rebooting the instance.

I'm sure this is a bug or a misconfiguration with something on the server, but when the instance gets into this state, I can't log in via SSH to do any investigating. Unfortunately, Amazon doesn't provide a way for you to access the instance via a console.

So, I guess my question is -- is there a way to configure the machine such that in any 100% CPU situation, we give priority to SSH to allow root to log in and investigate?

Or at least, is there any easy way to automatically kill any process/processes when this sort of situation occurs?

By the way, this is a "C1.xlarge" instance on amazon, which means it has 8 cores.

Also if it helps, the machine is set up as a web server running Plesk. And don't tell me that Plesk can't be run within EC2, because I've been doing it just fine for months ... until recently. The machine is already running PLesk's version of monit, so I'd rather not set up a second monit.

4 Answers4

1

You could try modifying the sshd init script to start it up with a nice value of -5 or -10. That'll change the value for all SSH logins, which may be fine for you.

MikeyB
  • 39,291
  • 10
  • 105
  • 189
0

I know of no way to handle ssh logins in a special way in this cases, but you should check your cronjobs and logfiles. Especially the syslog, which should record any cronjob started and see if that is somehow related to your issue. Then you should be able to identify the cause of your problem. May its even a kernel bug, that could be also recorded in the syslog.

tex
  • 889
  • 1
  • 9
  • 19
0

An easy thing to do would be to log the CPU usage of all processes. Something like this:

top -l 0 > top.log 2>&1 < /dev/null &

This will continually log the output of top to top.log. The redirections are there because sometimes I've noticed problems with backgrounded jobs started in SSH sessions that do not have STDOUT, STDERR, and STDIN all tied to something.

Anyway, after your next reboot, you could just read the bottom of that log and see what processes are hammering the CPU.

The above will produce quite a lot of output. You could instead make it write out once every 5 seconds like this:

top -l 0 -s 5 > top.log 2>&1 < /dev/null &
molecularbear
  • 348
  • 1
  • 3
  • 9
0

feels like there's a fork bomb somewhere exhausting the PID space. Linux reservers some processes/use of memory for root, but that would require

  • manually logging in from console as root
  • kill -STOP the misbehaving process (if kill -9 some other process will re-fork and occupy the slot)

since you can't log-in, I suggest setting ulimits to memory use, number of open files and so forth.

lorenzog
  • 2,799
  • 3
  • 20
  • 24