1

I am currently managing some servers for a client running close to 40 websites, nearly half of which are WordPress websites. We are currently using 4 VPS from Linode with the sites distributed across the 4 servers relatively evenly. The servers are running the latest version of CentOS and have about 1GB of ram each.

We have been encountering recurring outages, but last night's outage was the oddest. The websites went down, so I logged in to Webmin and our webserver, DB server, DNS server, etc were all down. I started them back up and logged in via SSH just to find that the server was crawling. Running TOP showed that nothing was hitting the server hard and it did not look low on resources at all. Looking at the Linode graphs, everything was fine leading up to the outage (from what I could see), then there was a sharp drop off in CPU%, IO, network activity, etc. Just before that, disk IO was pretty high since our nightly backups were being taken, but that was the only major activity.

I'm at a bit of a loss with where I should continue from here. The client is very frustrated and rightfully so.

What suggestions do you have to help troubleshoot and solve this?

Your help is greatly appreciated.

  • 1
    Hire a sysadmin. You sound like you're out of your depth. (Webmin is a dead giveaway! ;) – Tom O'Connor May 14 '13 at 23:21
  • Oh, and I'd probably move away from Linode and onto a proper webhost who actually give a toss about IO isolation on VMs. – Tom O'Connor May 14 '13 at 23:22
  • Heh, thanks @TomO'Connor. Might be true, but I'm trying to learn as much as I can... and I guess being tossed in the deep end can do that. I'm quite certain that the web servers need to be tuned, which is something that I am learning about for the first time. Would you be able to recommend any sysadmin resources? – unknownperson May 15 '13 at 04:48

2 Answers2

1

I would look at my logs and contact Linode for help.

  • Linode said that without agent based monitoring, there would be no useful logs to troubleshoot this. Can you or anyone recommend a good monitoring system? – unknownperson May 14 '13 at 22:12
0

Did you looked in logs?
Maybe memory run out, and OOM killer terminated them. Quick check : run dmesg, should be seen easily in that.
ON a side note i don't really get why would you run 4 VPS , each 1GB big , instead of a single VPS with 4GB RAM.

Sandor Marton
  • 1,564
  • 9
  • 12
  • Thanks for the response @Sandor. I've never used dmesg before, but this is the only memory related information I was able to find with the command: `Initializing HighMem for node 0 (0002d1fe:00040800) Memory: 1024556k/1056768k available (6807k kernel code, 23572k reserved, 1826k data, 420k init, 309256k highmem)`... then a break down. The reason why we went with 4 VPS'S vs 1 big VPS was for site separation. It was requested, specifically by the client, that the websites reside on different VPS's in different physical locations. I am suspecting running out of memory as well. – unknownperson May 14 '13 at 23:10
  • I have checked logs, but little useful information was found. There were some httpd processes that were killed several times during the period leading up to the outage, which would re-enforce the thought of running out of memory. Perhaps Apache tuning is necessary. – unknownperson May 14 '13 at 23:15
  • 1
    Or maybe increase swap, though that would slow down everything when not enough memory. But you need some kind of monitoring both for alerts and historical stats, without that hard to tell what tune. It may be a short increase in visitors-> lot of apache process -> out of memory. You may want to adjust MaxClients or perhaps change Apache to nginx for example – Sandor Marton May 14 '13 at 23:27
  • I think web server tuning is where I need to head at this point... – unknownperson May 15 '13 at 04:49