6

Seems every day a website I manage has been going online and offline between 12a and 12:25a. I have no idea what is causing the issue so I am seeking guidance on where to start. It is a Wordpress based site.

So here is what I DO know:

I have a pingdom account which alerts me when the site goes offline so we can see every day, like clockwork, the site goes on/off.

At the time of the ups/downs I see a lot of strain on the memory usage. Look at the load average when the site is going online/offline (http://screencast.com/t/BRlfXkqrbJII). Then I ran this command to restart http (http://screencast.com/t/usVtYWZ2Qi) and the memory usage then goes down to this (http://screencast.com/t/VdTIy3bgZiQB). An hour after I restarted http, the site then went offline/online so restarting the http didn't do much help.

When the site is going offline/online, I ran the top command and get this (http://screencast.com/t/zEwr7YQj3). Here is a top command when the site is at it's lowest (http://screencast.com/t/eaMfha9lbT - so this would be dubbged "normal").

I have removed all cron scripts that are on my server (backups, etc). I also have removed every single cron within my Wordpress install. So in theory nothing is running at all.

Here is a bandwidth report (http://screencast.com/t/AS0h2CH1Gypq).

The traffic doesn't seem to be that much (http://screencast.com/t/s7hrWNNic1K), but looking at my times the site is going up/down this may be one of the reasons?

I have the dvp Nitro package at Media Temple (http://mediatemple.net/webhosting/nitro/).

So at this point I would request some help in trying to figure out what the cause of this is, and how I can go about pinpointing this issue. ANY HELP is greatly appreciated.

sysadmin1138
  • 133,124
  • 18
  • 176
  • 300
Zach Smith
  • 278
  • 2
  • 11
  • 1
    If apache is generating the load, look at your traffic logs to see what kind of traffic is being served. Let us know what it is. – blueben Jan 10 '11 at 17:58
  • Are you sure it's not the Plesk Backup process? That's what's causing high load on our server. – 2ndkauboy Jan 10 '11 at 18:03
  • 1st principles: is this a dedicated box? VPS? Shared hosting? Turn-key wordpress instance? – David Mackintosh Jan 10 '11 at 18:49
  • I have the dvp Nitro package at Media Temple (http://mediatemple.net/webhosting/nitro/). – Zach Smith Jan 10 '11 at 20:08
  • Are other services affected at midnight? is is a machine wide issue or just apache stopping and restarting? Is it just your website or others running on the same hosted service? How long is it down for? is that consistent? – hookenz Jan 11 '11 at 07:39
  • Does the cleaning lady come in at midnight and unplug something to plug the vacuum in? ( ;) ) – carlpett Jan 11 '11 at 09:12
  • i ran a perl script to ping the server every second internally and i found that the http was not effected when the site went offline this morning. – Zach Smith Jan 11 '11 at 12:24
  • Are there any UFO sightings around same time? Otherwise it could be something like nightly backup/regular action that place around same time – mamu Apr 12 '11 at 17:23

6 Answers6

3

You need to look at more logs. Check /var/log/messages at around midnight (and perhaps /var/log/messages.0, /var/log/messages.1, etc. for previous nights). Look at your http.conf to find where your apache logs are stored (that file should be in /etc/http/conf). The ErrorLog directive in that file will tell you where your apache error logging is going (typically an error_log file somewhere). Look at that file to see what it reports around midnight. Check other files in /var/log for unusual activity you can correlate. Logfiles should tell you why the webserver is failing at midnight.

Phil Hollenback
  • 14,947
  • 4
  • 35
  • 52
  • I see nothing relevant in the `messages` files. i don't see anything significant in the other files you mentioned. – Zach Smith Jan 10 '11 at 18:22
  • 1
    If possible, try running some sort of script on the local machine that polls the webserver on port 80 via wget or curl at regular intervals. Then if that fails at midnight oyu know the problem is on the sever. If that test doesn't fail, you know the problem is external to the machine. – Phil Hollenback Jan 10 '11 at 18:53
  • so if it is an issue on my server, then what direction do I go? same question with if it's external? i'm lost here... – Zach Smith Jan 10 '11 at 19:01
  • That's the beauty of it - we need to figure out exactly what's going on, before we can decide what action to take. If you can, post sections of the logs on here and it might give a bit more insight on what's going on. – Christian Paredes Jan 10 '11 at 19:36
  • @Phil, isn't my pingdom account which pings the site every minute the same thing as the script you talk about to ping the server? – Zach Smith Jan 10 '11 at 20:11
  • No, because I'm talking about querying the server _from_ the server itself, which rules out whether the failure is on the machine or out on the network. – Phil Hollenback Jan 10 '11 at 20:13
  • @Phil, ty for the clarification. i am trying to find something made already that I can use at the moment. – Zach Smith Jan 10 '11 at 20:14
  • Seems site went backdown again just now. Seems sql thread monitoring is down to 4 threads per second which is small (http://screencast.com/t/eKxDy3fR), no memory usage, no large amount of viewers on the site (http://screencast.com/t/gKb8bbLuqy). What am I missing here?? – Zach Smith Jan 10 '11 at 20:47
  • @Phil, the site went offline again this morning at 12:25am, and the perl script shows the server http was working internally. now that i have done this, what does this tell us and where do we go from here? – Zach Smith Jan 11 '11 at 12:24
  • here are the emails of the alerts (http://screencast.com/t/hwGFkp4Hck). – Zach Smith Jan 11 '11 at 13:12
  • What about setting up a cron script on the server that pings, say, google.com and logs the results to a log file? – Christian Paredes Jan 11 '11 at 20:00
3

According to the 'hits per hour' graph that you posted, you get 13,000+ requests in the midnight hour. This is your highest hour by far. When you do a 'service httpd restart' you see a warning message about your MaxClients exceeding your ServerLimit and it's lowering your MaxClients to 200. This means that you're allowing 200 httpd clients. Your httpd clients are consuming about 40M each. 200 * 40 = 8GB. Mysql is also taking up 300MB. The OS needs some too. You have no swap configured. Your disk cache is at zero at this time according to the 'top' output that you've posted, but there is a lot of memory free. That's kinda weird and it's throwing me for a loop.

Linux might be implementing the OOM killer. Check dmesg output for those signs. I'd suggest lowering your MaxClients and/or increasing the amount of RAM (or possibly adding CPU power.) You can also look in your apache logs to find out what is hitting your site at this hour. If it is legitimate traffic then increasing the RAM/CPU is the way to go. If it isn't, then mitigation is the path to take.

toppledwagon
  • 4,245
  • 25
  • 15
2

Are you being spidered too aggressively?

Check your Apache logs and try making some adjustments to your robots.txt:

User-agent: BadBot
Disallow: /

Cheers

HTTP500
  • 4,833
  • 4
  • 23
  • 31
  • i don't think that is the issue, as if you look at a list of our crawlers we had the LOWEST crawl rate when the site went offline this morning at midnight (http://screencast.com/t/FF6E1hgEJX). – Zach Smith Jan 10 '11 at 18:24
1

May I suggest that you set up cron jobs that perform periodic monitoring during that time? Set up a script that outputs the CPU usage, memory usage, etcetera during that time of your services. You might also want to add a ping to that periodic script so that you can ensure that your server has a working network connection during the outage. The last thing I'd add to that periodic diagnostic script is a wget request to your site during the downtime, across the localhost interface.

It's possible that other systems at your hosting provider may be causing these problems - it may not be your server at all. Hopefully building a script to run server-side can give you additional diagnostic information, and help you to find the cause of the problem.

Is your server virtual? It's possible that your provider performs various snapshotting (from DomU) at that time which may freeze the other domains.

  • thanks. but how would i find out if the provider is performing any snapshots? – Zach Smith Jan 11 '11 at 12:25
  • Well, if you're being snapshotted, it's highly likely that you're going to have your disk activity frozen during the snapshot period, so you're likely to find gaps in your periodic logs. Alternatively, if you wanted to be a little aggressive about your diagnosis, you could set up your periodic script to attempt to consume resources during that 25 minute period, to determine whether your available resources are diminishing during that period. If you were going to move through that approach, might I suggest proving your CPU first, followed by your memory, followed by your network? – David Hagan Jan 11 '11 at 21:29
  • If you want to prove your CPU, using simple tools that are already on your server, you may want to try zipping and re-zipping a directory whose size you know - the time taken should be roughly uniform, which will give you a simple "clock" to check whether you're losing CPU cycles during that period. If you want to check your memory, you might want to create a RAMdisk which should consume the remainder of your free memory, copy in and out of it and remove it, etcetera. If you're hoping to test your network, you might want to network copy from the previously mentioned RAMdisk. – David Hagan Jan 11 '11 at 21:31
  • These may upset your hosting provider, or you may want to be careful about how you do them, in case you're paying for processing cycles or network IO. – David Hagan Jan 11 '11 at 21:33
0

What time do your logs rotate? If they rotate around midnight, and this is a shared hosting server, then the log rotation itself may cause a lot of load and cause your site to go down.

Here's an option to look at: i=0 while [ $i < 86400 ]; do top -b >> /tmp/top_file sleep 60 $i++ done

This will run top in batch mode once a minute for an entire day and give you a bunch of possibly useful information. You need to look at CPU utilization, disk io utilization and memory/swap usage.

Also, your hosting package appears to be a VPS. Maybe your VPS doesn't have a problem, but your base OS does? A snapshot style daily backup of the virtual disk image may take 5 minutes?

Devdas
  • 737
  • 4
  • 6
  • re: log rotation: how would i figure this out? re: the command: where oes the output go? re: backups: i have removed all backups from Plesk and the server itself until i get this resolved. – Zach Smith Jan 11 '11 at 02:20
  • i have figured out the rotation but doesn't seem it runs each day at midnight (http://pastebin.com/kLcwDHM6). – Zach Smith Jan 11 '11 at 02:51
0

Hmm... if you don't have any cron scripts or other processes that may cause those reboots, how about asking the manager of the physical host to check if the server is having some hiccups at midnight?

bitwelder
  • 216
  • 3
  • 7