0

I came into work to a problem I haven't run into before, and was curious to see if anyone here might have any ideas about what may have caused it.

We run a VPS from Slicehost and sometime yesterday, the sites we host went down (after working properly for several months since I did anything with the server), but only those over HTTP (using port 8080). The HTTPS sites (standard port) were still operational, if they were accessed specifically using https://site.com (as opposed to putting in site.com and letting the redirects do the work), as well as SSH connections directly to the server.

It stayed like this until this morning. I rebooted the server, but that didn't help. I SSHed into it and made sure everything was up and running properly. There were no error messages out of the ordinary in the Nginx logs or other logs that I checked. Still, nothing changed. Then all of a sudden, about half an hour after I did that, while I was searching to find the cause, the sites started working again.

I never did find anything about what might have caused the issue (everything I was finding was client side issues), so I was curious as to what might have potentially caused the issue. That way, I could better diagnosis and fix it if something like this happens again.

Shauna
  • 178
  • 1
  • 7
  • 3
    Perhaps vendor related issue which they resolved during your effort to determine a root cause? I have had vendors resolve issues before admitting they did anything. – Chris Jul 05 '11 at 13:48
  • @Chris - That's what I've been leaning toward, but I was hoping someone might have some suggestions for what might cause it on my end, or at least confirm that it was a vendor issue. – Shauna Jul 05 '11 at 14:03

1 Answers1

2

Practically anything could have caused the problem. Unless someone's happened to actually have this exact problem happen, with the same cause, you probably won't get a solution to the question you pose.

However, to help you get to the cause of the problem next time, here are some diagnosis tips:

  • First, is the network traffic actually getting to the server? tcpdump -i ethN -n port 8080 and try to make the request. If tcpdump shows nothing, it's a network problem. Hassle Softlayer.
  • If the traffic does get through, run iptables -L INPUT -v >/tmp/before, hit the site, run iptables -L INPUT -v >/tmp/after, and then diff /tmp/before /tmp/after. Any differences in packet/byte counts indicate a possible firewall rule that's blocking the traffic. You'll need to verify each rule to actually determine whether it's the cause of the problem or not. (This is why it's a good idea to log your firewall blocks; makes this sort of thing much easier).
  • Run netstat -ltnp |grep :8080 to verify that nginx is, in fact, listening on the port of interest, and that it's listening on the correct IP. Don't take anything for granted at this stage of the game.
  • If there's no firewall rule blocking the traffic and the process you think should be listening is doing so, then strace the nginx processes (strace -p <pid> -p <pid> for all associated with nginx) and ensure if they're getting the traffic, and see whether (and what) they're doing about it.
womble
  • 96,255
  • 29
  • 175
  • 230
  • It mysteriously came back up, which suggests that it may have been a hosting issue. Thanks for the tips, though. – Shauna Jul 29 '11 at 13:08