4

I have a server running CentOS Linux, and very rarely (maybe once every 3 months) something happens that causes it to have an exceptionally high CPU load (400%) that causes the server to basically freeze up.

The problem I have is that when I reboot the server, I can't figure out what caused the spike. I tried setting up a cron job to occasionally dump to a log file the top 10 CPU processes, but when the CPU load is high the cron job apparently won't run either.

I'm sort of new to running a server, so I'm hoping you guys might have some advice on how I could better log the processes and figure out what's causing the sudden spike the next time it happens. I'm sure it's just a script or process that goes out of control, but until I can figure out which one it is I'm sort of at a loss...

Thanks for any help you can provide!

3 Answers3

4

Not strictly speaking an answer to your question, but check out monit. You can configure it to monitor all kinds of stuff, including global system stats. For example, if cpu usage is over 97% for 3 minutes, my servers will reboot. If apache uses >80% cpu for 5 minutes, it gets restarted, and so on. It's an incredibly useful piece of software and has me sleeping much, much easier at nights. :-)

chmac
  • 1,017
  • 1
  • 8
  • 16
  • 3
    Any insight into why the downvote? Is it bad form to post useful info that's not strictly an answer to the question? Was there an issue with the accuracy or validity of what I posted? – chmac Dec 27 '14 at 10:29
  • 1
    Probably because it has nothing to do with diagnosing load spikes and recommends rebooting the machine to solve load problems. In other words, "[n]ot...an answer to your question." – Eric Mar 23 '21 at 03:02
0

How often did you run that logging cronjob? Maybe you should run it more often, because CPU usage doesn't peak instantly, you have to see an increase somewhere. Alternatively, you could use atop to monitor resource load (including CPU load) overtime.

halp
  • 2,208
  • 1
  • 20
  • 13
  • The cronjob was originally running every 10 minutes, but I'll change it to every minute from now on so I can hopefully catch it. –  Sep 21 '10 at 00:35
  • CPU load average displayed by `top` command shows 3 values: during the last minute, during the last 5 minutes and during the last 15 minutes. So 10 minutes was certainly not the lowest granularity level you could use when attempting to spot the CPU hog. – halp Sep 21 '10 at 20:52
0

It's a possibility that it's not CPU-related at all. If you look at utilities like sar (sysstat), you might be able to get more information about what was going on at the time of the system hang (CPU / disk IO / memory / swapping / etc).

I do have a couple questions:

After rebooting, do you see log entries for the period which the system was frozen?

How do you determine that the system is frozen?

Are you able to log in at all?

  • I don't see any log entries for the period it was frozen, and it's not actually frozen just bogged down. I know it's a CPU issue cause the server provider (Linode) has a graph showing resource usage and the CPU is the one that spikes through the roof. If only they provided a list of processes percentage use :) I also can't login at all, but it's cause it times out not because SSH is down... –  Sep 21 '10 at 00:37