0

We have a dedicated machine that mostly serves as a webserver. It is running Plesk for several domains, our webservers and the munin central node that is connecting to around 10 other machines that run munin-node.

Today our server got unresponsive. Any calls to any website or the mail servers would time out. SSH would also timeout and users complained they could not play anymore.

I issued a hard reset via the provider dashboard and after a time everything was back up again. So I checked the syslog: Our monitor services reported the first timeout at 11:36. The last entries in the syslog before that time are these two:

Jul  7 11:30:19 xxx CRON[7666]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul  7 11:30:30 xxx CRON[7671]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)

Could Munin somehow be at fault for the server becoming unresponsive? If so, how could we tackle the issue?

MadHatter
  • 79,770
  • 20
  • 184
  • 232
rewb0rn
  • 27
  • 1
  • 7

2 Answers2

0

There is no indication that Munin is at fault. You're just seeing the last log entries your server managed to write.

There are so many reasons the server could have crashed or locked up. It would have been good to look at the console before hard resetting it. You'll have to look deeper and keep an eye on things. The first thing I'd look into is out of memory problems which can result in software that doesn't respond or gets killed off. Or very high load... Or... so many things.

If you had some good software monitoring this server's resources/availability, etc... you'd have more to go on next time this happens. I really recommend this.

Ryan Babchishin
  • 6,260
  • 2
  • 17
  • 37
-2

According to this munin page Your last entry corresponds to a plugin running, and it is in charge of checking status of apt packages updates in your monitored servers

I would disable the plugin for a few days and see how it goes, but considering it is a bare metal server a hard drive SMART check is in order, along with a RAM test afterwards

RAM test requires a reboot and an outage, SMART disk check is non-disruptive

kamihack
  • 312
  • 1
  • 6