1

my server has been going down over the last 24-48 hours, CPU spikes from 15% to 100% and server becomes unusable and all my sites go down as a consequence of it.

Any tips on how I could track what is going on on my server?

Any suggestions on software that could help me prevent the CPU to max out indefinitely, and maybe force an automatic reboot of the box?


Would be very useful to have a pointer on what to do, and would be very appreciated. :)

RadiantHex
  • 547
  • 2
  • 9
  • 18

4 Answers4

2

you should try to investigate what the problem was. check for /var/log/messages before you reboot it and other logs by time.

than you can try to setup something like virtualbox for test purposes, and run yours main servers in it. this will decrease productivity but add some stability and you could access it.

also check for automatic updates. they could eat yours CPU.

MealstroM
  • 1,517
  • 1
  • 17
  • 32
2

Install Munin. Also don't be afraid to sniff traffic

SoMoSparky
  • 161
  • 2
  • 5
2

For monitoring you may try to use monit -- it should be able to restart a runaway server, if you put it under its control.

As a fast-and-dirty solution you may put something like

date >> /var/log/cpu_hogs && ps -eo pcpu,pid,user,args | sort -r -k1 | head -5 >> /var/log/cpu_hogs

into cron to be run every 5 minutes or so and after a crash have a look what was eating your CPUs just before server crashed.

Paweł Brodacki
  • 6,511
  • 20
  • 23
1

You should turn on Linux Process Accounting if you want a more detailed historical view of what was using CPU and other resources at the process level and user level than /var/log/messages et al. normally provides.

As for automated reboots when the server becomes unresponsive, what you'll want to look into is called watchdog (ubuntu man page).

rthomson
  • 1,059
  • 9
  • 14