I am currently managing some servers for a client running close to 40 websites, nearly half of which are WordPress websites. We are currently using 4 VPS from Linode with the sites distributed across the 4 servers relatively evenly. The servers are running the latest version of CentOS and have about 1GB of ram each.
We have been encountering recurring outages, but last night's outage was the oddest. The websites went down, so I logged in to Webmin and our webserver, DB server, DNS server, etc were all down. I started them back up and logged in via SSH just to find that the server was crawling. Running TOP showed that nothing was hitting the server hard and it did not look low on resources at all. Looking at the Linode graphs, everything was fine leading up to the outage (from what I could see), then there was a sharp drop off in CPU%, IO, network activity, etc. Just before that, disk IO was pretty high since our nightly backups were being taken, but that was the only major activity.
I'm at a bit of a loss with where I should continue from here. The client is very frustrated and rightfully so.
What suggestions do you have to help troubleshoot and solve this?
Your help is greatly appreciated.