nginx - Load Balancer - Considerable lag when upstream node is offline/down

Question

Running nginx 1.0.15 on CentOS 6.5. I have three upstream servers and everything works fine, however when I simulate an outage, and take one of the upstream servers down, I notice considerable lag in response times (additional 5-7 seconds). The second I bring the downed server back online, the lag disappears. Also, another weird thing I noticed, if I simply stop the httpd service on the simulated outage server, the response times are normal, the lag only occurs if the server is completely down.

Here is my conf:

upstream prod_example_com {

    server app-a-1:51000;

    server app-a-2:51000;

    server app-a-3:51000;

}


server {

    # link:  http://wiki.nginx.org/MailCoreModule#server_name
    server_name example.com www.example.com *.example.com;

    #-----
    # Upstream logic
    #-----


    set $upstream_type prod_example_com;


    #-----

    include include.d/common.conf;

    # Configure logging
    access_log  /var/log/nginx/example/access/access.log access;
    error_log   /var/log/nginx/example/error.log error;

    location / {

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_pass
        proxy_pass  http://$upstream_type$request_uri;

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_set_header
        proxy_set_header    Host    $host;
        proxy_set_header    X-Real-IP   $remote_addr;
        proxy_set_header    X-Forwarded-For     $proxy_add_x_forwarded_for;
    }

    location ~* \.(js|css|png|jpg|jpeg|gif|ico)$ {

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_pass
        proxy_pass  http://$upstream_type$request_uri;

        # link: http://wiki.nginx.org/HttpProxyModule#proxy_set_header
        proxy_set_header    Host    $host;
        proxy_set_header    X-Real-IP   $remote_addr;
        proxy_set_header    X-Forwarded-For     $proxy_add_x_forwarded_for;

        proxy_hide_header expires;
        proxy_hide_header Cache-Control

         # Even tho this reads like the older syntax, it is handled internally by nginx to set max age to now + 1 year
         expires max;

        # Allow intermediary caches the ability to cache the asset
        add_header Cache-Control "public";
    }
}

I have tried the suggestions on similar posts like this. And apparently my version of nginx is too old to support health_checks as outlined in the nginx docs. I've also tried to explicitly set the max_fails=2 and fail_timeout=120 on the app-a-3 upstream definition, but none of these seem to avoid the additional 5-7 seconds lag for every request if app-a-3 is offline.

-- Update --

Per request, here is the output for a single request when app-a-3 is completely down. The only thing I could see out of the ordinary is the 3 second lag between initial event and subsequent event.

-- Update #2 --

Looks like a few years ago Nginx decided to create Nginx Plus, which adds active health checks, but for a yearly support contract. Based on some articles I've read, Nginx got sick of making companies millions, and getting nothing in return.

As mentioned in the comments we are bootstrapping and don't have the $$ to throw at a $1,350 contract. I did find this repo which provides the functionality. Wondering if anyone has any experience with it? Stable? Performant?

Worst case scenario I will just have to bit the bullet and pay the extra $20 / month for a Linode "Node Balancer" which I am pretty sure is based off of Nginx Plus. The only problem is there is no control over the config other than a few generic options, so no way to support multiple vhost files via one balancer, and all the nodes have to be in the same datacenter.

-- Update #3 --

Here are some siege results. It seems the second node is misconfigured, as it is only able to handle about 75% of the requests the first and third nodes are handling. Also I thought it odd, that when I took the second node offline, the performance was as bad as if I took the third (better performing) node offline. Logic would dictate that if I removed the weak link (second node), that I would get better performance because the remaining two nodes perform better than the weak link, individually.

In short:

node 1, 2, 3 + my nginx = 2037 requests

node 1, 2 + my nginx  = 733 requests

node 1, 3 + my nginx = 639 requests (huh? these two perform better individually so together should be somewhere around ~1500 requests, based on 2000 requests when all nodes are up)

node 1, 3 + Linode Load Balancer = 790 requests

node 1, 2, 3 + Linode Load Balancer = 1,988 requests

Can you adjust the error_log to debug and post the log entries when you shutdown app-a-3? — gtirloni, Aug 29 '14 at 20:19
It's interesting to know that nginx spent 3 seconds (see line 71-85) to realize that the upstream server is offline and then spent 3 second (again) (see line 95-118) to retrieve content from another upstream. — masegaloeh, Aug 30 '14 at 16:09
That's what I am struggling with. It's my impression that nginx has a black box algorithm to determine if an upstream server is down, if so do not send another request to it for X number of seconds. — Mike Purcell, Aug 31 '14 at 01:13
Would you consider simply updating nginx? It sounds like it's not behaving as it should and many performance improvements have been made since. — Grumpy, Sep 01 '14 at 23:23
@Grumpy: It's not a question of unusual behavior nor performance, rather it's a question of handling pool outages. The newer versions of nginx force you to use nginx plus if you want active health checks. — Mike Purcell, Sep 02 '14 at 01:32
if you can't afford a "professional load balancer", go cloud. — Giovanni Toraldo, Sep 02 '14 at 19:24
nginx 1.0 is too old. really. it's not worth looking for load-balancing bugs in it (which it certaintly has, many). you may consider going with the latest version and something like this: https://github.com/yaoweibin/nginx_upstream_check_module — gtirloni, Sep 03 '14 at 17:52
@gtirloni, I have actually tried that mod. I read the notes and it said it can support up to 1.2.9, and 1.5.x and 1.7.x, the problem is, the 1.5 and 1.7 versions are "mainline" meaning they aren't considered stable. So the latest stable release would be 1.2.9. I have this version running in my lab, but haven't had much time to test it. — Mike Purcell, Sep 03 '14 at 22:32

mc0e · Accepted Answer · 2014-09-01T10:44:24.367

5

If nginx sends a request to a closed port on a server with a functional IP stack, it'll get an immediate negative acknowledgement. If there's no server there to respond (or if you drop the incoming packet at a firewall) then you'll have to wait for the connection to time out.

Most load balancers have a polling mechanism and/or heartbeat for preemptively checking for a down server. You might want to look into those options. Polling isn't usually run against a web server more than once or twice a minute, but a heartbeat check for server down situations might be every second or so.

Nginx is not the most sophisticated of load balancers. If you're getting into this sort of issue you might want to look at other options.

EDIT: Something like this maybe? http://www.howtoforge.com/setting-up-a-high-availability-load-balancer-with-haproxy-heartbeat-on-debian-lenny . For a smallish installation, there's no need for separate servers, just put it on the web server boxes. That gives load balancing, but not caching. There are also HA solutions that control squid or varnish in response to a heartbeat.

edited Sep 01 '14 at 10:44

answered Aug 31 '14 at 06:56

mc0e

5,866
18
31

As we are bootstrapping, we really can't afford a commercial level load balancer like an F5 at this point, maybe after first round closes. What I don't get is why the passive checks like `max_fails` and `fail_timeout` directives don't seem to work. According to nginx docs (http://nginx.org/en/docs/http/ngx_http_upstream_module.html#health_check) the active health check is available, but only with a "commercial subscription" at $1,350 a year. Surely someone else has had to have encountered this scenario. – Mike Purcell Aug 31 '14 at 17:06
@MikePurcell Edited my answer to include a likely alternative software loadbalancer. – mc0e Sep 01 '14 at 10:20
1

If you only have 3 upstream servers, a hardware load balancer would be overkill even if the budget was there. Check out HAproxy. – mc0e Sep 03 '14 at 04:41
Obviously I am looking to scale, so could be many, down the road. HAProxy really isn't the route I want to take, as I prefer nginx as it can properly handle multiple vhosts and redirect to various clusters. – Mike Purcell Sep 03 '14 at 05:26
Intuitively I'd expect haproxy to scale better, but look for results from someone who has actually tested it. Whether you use nginx or haproxy is up to you, but the important part of what I've said is that you need heartbeat or something equivalent. – mc0e Sep 03 '14 at 08:19
How is heartbeat going to help my situation? Heartbeat (or keepalived) would be great if I wanted to establish fault tolerance for my nginx load balancer. Single IP points to LB1, if LB1 is not reachable the IP automatically switches over to LB2. I set this up for my master-master mysql replication for a single master write, I don't see how it's relevant to my problem. However it does seem I should start reading up on HAProxy, although I have to believe, with all the nginx users out there, someone has encountered this same issue. – Mike Purcell Sep 03 '14 at 16:04
How are you currently moving the IP between your load balancers? – mc0e Sep 05 '14 at 09:52
I never mentioned anything about having multiple load balancers, nor was it part of the OP. I wasn't asking how to implement fault tolerance at the load balancer level, I was asking why Ngninx fails to disregard an offline node in an upstream pool. – Mike Purcell Sep 05 '14 at 14:50
@Mike Purcell You said "Single IP points to LB1, if LB1 is not reachable the IP automatically switches over to LB2". I was wondering how you were doing that. That's the sort of thing you'd usually use heartbeat for. – mc0e Sep 06 '14 at 03:21

score 4 · Answer 2 · answered Sep 03 '14 at 17:21

4

A couple of things you can try

Update to the latest version of nginx from the official repos http://nginx.org/en/linux_packages.html#stable
Try reducing the proxy_connect_timeout setting set it to something really low for testing say 1 second. http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_connect_timeout

answered Sep 03 '14 at 17:21

Rwky

774
1
8
17

We had our `proxy_connect_timeout` set absurdly high, and I thought this was why NGINX was taking 2min+ to start. Adjusting this down to just `2s` did not make a difference in startup time when some upstream(s) were down. Not sure what I'm missing here. NGINX should start fast even if some upstream is down. – Josh M. Feb 18 '23 at 20:26

score 1 · Answer 3 · answered Oct 02 '14 at 15:29

Over the last few weeks I have been working with NGINX pre-sales engineering team in trying to resolve the issue before I purchase the support contract. After a lot of tinkering and collaboration, the only conclusion we could surmise for the increased lag when a single node goes completely dark, is that the servers in question were all running the much older Apache 2.2.

The NGINX engineers were not able to recreate the issue using Apache 2.4.x, so that would be my suggested fix if anyone else encounters the same situation. However, for our project, I am working on shifting away from Apache altogether, and implementing NGINX with php-fpm.

In closing, our environment will be to use NGINX+ (requires the support contract) as the load balancer due to it's ability to issue health checks to upstream nodes, and issuing requests via round-robin to upstream nodes running NGINX (FOSS).

nginx - Load Balancer - Considerable lag when upstream node is offline/down

3 Answers3