0

I usually have Nagios agents installed on all our Linux servers so we get detailed report of what's happening on them in real-time and we also have historical data.

However there is one RHEL 7 server on which we can't install a Nagios agent(or monitor it over ssh etc) and on this server the load average is going up once every few days. This is a web server and we find out when users complain the site is loading slow. In most cases by the time we login and check the load is back to normal.

Is there any way, using the readily available OS tools and logs, I can find out what caused the load to shoot up?

I have gone through pretty much all log files including Apache logs etc, but I can't find anything obvious in them.

Are there any tools or daemons that could give me more information about such incidents?

Debianuser
  • 421
  • 4
  • 12
  • 29

1 Answers1

0

You may use Monit. This program regularly checks (on adjustable time interval - 2mins, 5mins...) number of vital system parameters, and loadavg is I think even on by default.

When parameter (loadavg) goes out of the adjustable threshold default is to send to you notification email. If this is favourable, you may login via ssh and do top / htop /ps and other standard tools in order to gain quick and rough insight about what is happening.

Second option would be to configure Monit's custom script execution instead of (or together with) sending of notification email. This custom script may do simple top -n 1 >> /tmp/performancefindings.txt and you would have good starting point to investigate high load averages.

Miloš Đakonović
  • 682
  • 3
  • 9
  • 28