0

We have some timeouts that are driving me crazy, under virtually no load (probably a couple people hitting the servers per minute).

We use nginx to redirect non-SSL to SSL, terminate the SSL, and then reverse proxy the request to haproxy which sends it to one of our app servers.

Our app servers run passenger (rails) + nginx. We have a mysql master + slave and a memcached instance as well which we recently started using for some queries.

Here is a typical error I see in the first layer in the nginx error log that passes the requests to haproxy (with details obfuscated):

2012/02/25 06:42:15 [error] 7838#0: *60797 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: domain.com, request: "GET /api/v1/some_route HTTP/1.1", upstream: "http://127.0.0.1:82/api/v1/some_route", host: "domain.com"

I am not sure if it's haproxy, passenger+nginx, rails, memcached. One empirical data point is that they seem to happen in bunches, i.e. if we get one timeout, we see several others, then they go away.

Any help would be greatly appreciated. Happy to post any configs or anything that would help.

VNO
  • 121
  • 5
  • One method of troubleshooting these annoying problems is to reverse prove the system, by starting with a simple setup that you know works and add parts until you see the error. In this case I would start by removing the mem-cache. – Tom Feb 28 '12 at 14:56
  • also if you run a write to the master mysql, and only use the slave as a warm standby then that is (almost) definitely not the issue. Also if you have logs of the connection times out requests from this side of any SSL tinkerings, then you can rule that out also. – Tom Feb 28 '12 at 15:00
  • I did what you suggest here last night, I took out the load balancer entirely and had the clients hit the servers directly, without SSL. There seems to be no timeouts occurring, so I guess it's either the load balancer or the SSL configuration. There is nothing unusual about my load balancer :( – VNO Feb 28 '12 at 18:10
  • where do you add SSL into the mix? hasnt your nginx already seen the SSL request in the log entry you provided in the question, or is SSL to tunnel some mid-part of the request internally? – Tom Feb 28 '12 at 19:09
  • Nginx redirects non-ssl to ssl, does the termination, and passes the request to haproxy. or previously it did, then i switched to just having nginx do the load balancing and we had the same problem. – VNO Feb 28 '12 at 23:46
  • I guess its hard to understand your setup, Am I right in saying that some request comes in from a customer like `http://myserver.com/index.html` and you send a 302 redirect back to the client, and then they come back with `https://myserver.com/index.html` or that somehow internally you make a SSL proxy request? – Tom Feb 28 '12 at 23:53
  • Yes, exactly on the first one. We are just basically redirecting all non-SSL traffic to SSL. However, the clients hitting the API are hitting the https endpoint directly so the redirection is not the problem. – VNO Feb 29 '12 at 17:53

2 Answers2

1

(its probably worth mentioning that I am not an nginx user, or indeed rails, so this is just initial guesses to maybe kick off the thread with some ideas for an answer)

From the details of your log entry, it seems that the external request is being internally forwarded by nginx on server with Host string domain.com" onto a local haproxy running on localhost:82 ?

If that is the case, then really I would look to correlate the log entries from nginx to haproxy, ie find that same request in haproxy log.

Given that I dont know about nginx so I am guessing, I think you need to determine whether this 110 message corresponds to a proxy_connect_timeout or a proxy_read_timeout, the former means that nginx didn't get any response from haproxy (host A sends SYN, your localhost:82 dropped the packet), and the latter that it connected but didn't send any data back (Syn-Syn-ack ack, but no data on the stream).

If the latter, its likely the problem is further back in your web stack, and you should look for the same log entry in memcache or mysql logs.

For example set your slow query log my.conf configuration on mysql and look and see if there is an entry corresponding to your request in that log file. I think my default is in /var/lib/mysql/slow.log but I guess might be some customization.

More generally, in these platforms where you have created a fairly complicated system, it is helpful to have some centralized logging infrastructure for dealing with event correlation. I am currently deploying logstash, for such purposes and there is obviously splunk and logblaze for commercial alternatives.

Tom
  • 11,176
  • 5
  • 41
  • 63
  • Thanks for helping out, Tom. I took out the load balancer and some timeouts were still reported. I'm wondering if it's memcached? We have newrelic pro monitoring on and the weird thing is they are not catching these. – VNO Feb 27 '12 at 19:59
  • well you can grep through the log file for similar errors and see whether they are all objects that would be cached, or whether there is anything similar. I guess my remark would be that at the moment you don't have good visibility here so its hard to make any progress. – Tom Feb 28 '12 at 14:54
0

I had the problem where a http response was only partially getting back to my browser. The problem was the autocaching of nginx. I had installed nginx into a special directory. I found that if I added the lines

in the http proxy_cache_path /var/lib/nginx/proxy levels=1:2 keys_zone=my-cache:8m max_size=1000m inactive=600m; proxy_temp_path /var/cache/tmp;

and in the location proxy_cache my-cache; proxy_cache_valid 200 302 60m; proxy_cache_valid 404 1m;

and changed the permissions on the tmp and proxy directories then the whole http response got sent to my browser

Andrew
  • 1