1

I find it pretty common for a Linux server to slow down to the point of complete unresponsiveness (LA 150+ etc), which when looking at it later using sar or munin or whatever it will show a sudden rapid increase in the number of processes. I generally need to reboot the machine at this point but it always leaves me wondering what caused the problem in the first place.

I'm assuming there is a rogue process going into some kind of loop creating loads of new processes, which then eat up the ram etc and cause the lockup. But how, after the event, can I determine which is the offending application/ process?

Thanks

spoovy
  • 354
  • 4
  • 15

1 Answers1

2

Install atop and configure it to save a snapshot every 60 seconds. Then, when your system goes nuts again, you can reboot and use atop -r /var/log/atop.log to go back in time and see what went wrong.

Flup
  • 7,978
  • 2
  • 32
  • 43