1

So I've got a website up and running with nginx/php-fpm/ubuntu

It works really well (and fast) and uses hardly any memory. My client started an ad campaign yesterday, and there were a couple times where for five or ten minutes at a time the website wasn't loading. I'm highly doubtful it was traffic overload, since statistics show there weren't very many visitors so far.

During these "outages" I would connect via ssh and run htop to see resource statistics. Processors (all of them) were around 0%, and ram was like 350mb out of 1024mb, and no swap.

I looked at the access logs really briefly and didn't see a whole lot there, though I did notice a couple bots poking around. I'm doubtful it's their fault since there's not a whole lot there (On a side note, what's a good way to "consume" simple text log files?)

What are all the steps to debugging this?

Matthew
  • 1,859
  • 4
  • 22
  • 32
  • 1
    "the website wasn't loading" ? On the same LAN were the web server is? Through the internet? On different machines? Different browsers? Did the server answer to pings? And access on port 80? – rems Feb 09 '11 at 15:40
  • Sorry I was just talking about a general website - so yes, through the internet, on port 80, with different machines & different browsers, different locations too. – Matthew Feb 09 '11 at 15:54

3 Answers3

4

The first step would be to isolate where the failure is happening. It sounds like you were able to connect to the server during the outage, so it seems unlikely to me that there was a general server failure or a server-local network problem.

The first thing I would do if my web browser was unable to bring up the page would be to establish if port 80 is responding to connection attempts. The easiest way to do that is to use telnet, eg (assuming you're using something Unix-like):

$ telnet your.server.name 80

Try it out with servers you know are working to see what a successful message looks like. For www.google.com, eg, I get:

 $ telnet www.google.com 80
 Trying 74.125.95.103...
 Connected to www.l.google.com.
 Escape character is '^]'.

(To exit from telnet in this state, you need to hit Ctrl-], then Enter, then Ctrl-D.)

Failures you might see include DNS failure:

$ telnet fake.dns.entry 80
telnet: could not resolve fake.dns.entry/80: Name or service not known

In which case you would follow up by trying to connect to the IP address.

Another failure possibility is a refused or timed-out connection:

$ telnet serverfault.com 99
Trying 64.34.119.12...
telnet: Unable to connect to remote host: Connection timed out

This generally means either the server or a load balancer in between you and the server is not listening on the correct port. You might also see:

$ telnet 192.168.0.237
Trying 192.168.0.237...
telnet: Unable to connect to remote host: No route to host

Which means the server doesn't exist at the address you thought it did, or there's a network routing problem in between.

You should first test this out from outside the network the server is on, preferably somewhere several ISPs disconnected. Then try it from the local network. Then try it from the local machine, using "localhost" in place of the hostname, assuming your web server is set to listen to loopback connections.

Once you know the pattern of the failures, then you can start trying to figure out where the failure is happening. My gut instinct is that your nginx or FastCGI is the root of the problem rather than some intermittent network problem that doesn't affect SSH traffic, but it's not really possible to troubleshoot further without first addressing the network question.

Hope this gives you some ideas of what to start with next time. Good luck.

Update

I just noticed your side question re the best way to "consume" log files. If you are in the middle of troubleshooting the problem, I recommend using tail. Open up two ssh sessions on the server, and in one tail -f /var/log/nginx/access_log and in the other tail -f /var/log/nginx/error_log (or whatever the paths are on your system).

If you need to dig through a dense log file after the fact, a good tool to start with is less. Just run less /var/log/nginx/error_log, and then press space to page down, b to page up, / to initiate a search, after which n will find the next search result and N will find the previous result, and use q to exit back to the shell.

I would guess there are better tools specific to particular types of logs, but tail and less usually get me about 90% of what I need when troubleshooting my logs.

daveadams
  • 1,279
  • 6
  • 12
0

You should use IP addresses external to your location, like proxies or something. You can try to utilize Tor network for this kind of testing. First thing is to check if the site is accessible from various places in the Internet. Probably, DNS records were changed recently and haven't propagated yet.

Alex
  • 7,939
  • 6
  • 38
  • 52
  • Hmm. My client lives on the other side of the country and gets the same thing - it won't load. DNS records were changed almost 2 weeks ago. – Matthew Feb 09 '11 at 15:53
  • Oh I see, you experienced these outages too, I just misread your question. Okay, what was an error? Just a timeout? What was in nginx access and error logs? – Alex Feb 09 '11 at 15:56
0

You've not provided any information about the how the server is configured / where its hosted. There are all sorts of things which might be affecting this - e.g. network connection problems, cpu contention issues on a virtual machine.

I assume you've got error logging configured correctly and have checkde there was no change in the pattern of errors during these outages.

There's probably not a lot you can do to analyse what happened in the previous event - but do look to see if there has been a variation in response times.

Going forward you might consider setting up iptables to log the start of every tcp handshake on port 80, and start writing %D to the logfiles. Then look to see if there's slow responses / gaps between syn packets and completed responses.

If the system is giving a consistent delay between the syn cookie and the response, then the problem is not with the software running on the machine.

Running external (http) and internal (just a daemon which writes something to a log file then sleeps for a shoer interval) heartbeats against the server might be a good idea too. Again if you see issues on the external heartbeat but not the internal, it points to a network problem, if you see gaps in both, then there's a problem with the hardware of the server itself.

Consider adding a client-side performance agent such as boomerang to log page response times too.

symcbean
  • 21,009
  • 1
  • 31
  • 52