I have a Varnish HTTP cache running in front of 40 workers. These workers are running on two identical Docker nodes (node-1
and node-2
, 20 on each) as a unique service (web_workers
). We're using Docker Swarm with the default endpoint_mode
, which is round-robin load-balancing. So Varnish forwards requests to a unique backend hostname (= Docker Swarm Virtual IP as far as I understand).
Not long ago, I noticed that nodes were receiving an unequal load of requests: node-1
processes 4 times more requests than node-2
.
Varnish uses long-lived persistent TCP connections to the backend, and node-2
was started after node-1
. I suspect the difference of load can be explained by the fact that node-1
was launched first, and the Docker Swarm naive round-robin load-balancing algorithm has just allocated persistent TCP connections to it before node-2
started and now it looks like this imbalance is persisting since Varnish uses long-lived TCP connections.
- How could I confirm this theory, if it makes sense?
- What are the possible workarounds?
I'm thinking of disabling TCP connection reuse on Varnish side, but it could result in a severe performance hit (to be tested). Another option is to not use a unique Docker service but have one per physical node. I also could restart Varnish instances as soon as the Docker service is updated, but this is a sensitive piece of infrastructure I don't want to restart. Any idea is welcome!
EDIT:
Backend configuration part of the VCL file:
backend origin {
.between_bytes_timeout = 5s;
.connect_timeout = 1s;
.first_byte_timeout = 5s;
.host = "web_origin";
.host_header = "web_origin";
.max_connections = 200;
.port = "8080";
}
sub vcl_init {
new origin_director = directors.round_robin();
origin_director.add_backend(origin);
}