Issues to loadbalanced web servers in network when one fails

Question

I have a network of one load balancer server (using nginx) lb1 which routes traffic between four web servers web1, web2, web3, web4. These four webservers are routed to using round-robin in nginx.

All servers are set to max_fails=1 and fail_timeout=5s, so when a server is down, it should be ignored fairly quickly if it is not online.

I should note that the average response time of the web pages from each web servers is around 50-150ms, if all four web servers are online. The issue arises when just ONE web server is offline. When one goes offline and a user tries to load another page, the response time varies anywhere from 50ms-25s. Yes, 25 seconds.

I am confused, because I would think that the round-robin and fail_timeout settings would make it so the offline server would be ignored.

Additional, possibly relevant notes: All four web servers are running apache with php5, and memcached is enabled between the four.

Are the web servers communicating in any way? Do they share any resources? Where is memcached sitting in this architecture? — Tim, Aug 12 '16 at 17:53
During this time does nginx say that it's disabled the failed host? Do you get the same slow response rates when querying the web servers from behind the load balancer? — Ryan Babchishin, Aug 12 '16 at 18:13
@Tim The web servers all connect to the same remote mysql database. They share all the data found on there, both reading and writing. memcached is installed on each web server and sends data between the servers to keep sessions persistent even if the user gets routed to a different server through the loadbalancer in the future. — Tyler Hanavan, Aug 12 '16 at 18:25
@RyanBabchishin When I use the public IP to directly connect to web servers, I get the quick response time (50ms-150ms). I'm not entirely sure how to see what nginx says about the failed host, but I do get this message when one of the web servers are down and I try to connect: [error] 23311#23311: *628 upstream timed out (110: Connection timed out) while connecting to upstream — Tyler Hanavan, Aug 12 '16 at 18:35

VBart · Accepted Answer · 2016-08-12T18:39:45.927

1

It seems you misunderstand the fail_timeout parameter. Please, re-read the documentation.

the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable;

The parameter doesn't limit the time of each such attempt, it's about how many of them should happen during the specified period to consider the server as down and stop directing requests to it.

You should tune proxy_connect_timeout, proxy_send_timeout and proxy_read_timeout directives, and increase the fail_timeout value.

edited Aug 12 '16 at 18:39

answered Aug 12 '16 at 18:34

VBart

8,309
3
25
26

This works perfectly. Thank you, you've saved me a lot of trouble. – Tyler Hanavan Aug 12 '16 at 18:51

Issues to loadbalanced web servers in network when one fails

1 Answers1