I have an architecture with two varnish servers sitting in front of 5 webheads. Each varnish server is configured with a round robin backend director, but at times of moderate to high load varnish seems to be heavily favouring the first defined backend in the list.
Varnish version is 3.0.5.
If the first backend is marked as sick the second backend in the list is heavily favoured, and so on.
varnish> backend.list
200
Backend name Refs Admin Probe
web1(************,,8080) 102 probe Healthy 8/8
web2(************,,8080) 17 probe Healthy 8/8
web3(************,,8080) 9 probe Healthy 8/8
web4(************,,8080) 17 probe Healthy 8/8
web5(************,,8080) 12 probe Healthy 8/8
Some parts of the VCL that might be pertinent:
probe healthcheck {
.request =
"GET /LICENSE.txt HTTP/1.1"
"Host: **********.co.uk"
"Connection: close";
.interval = 120s;
.timeout = 90s; # High values due to expected slow responses
.window = 8;
.threshold = 3;
.initial = 3;
#.expected_response = 200; # Still want the Magento maintenance page to display so no response code check
}
backend web1 {
.host = "************";
.port = "8080";
.connect_timeout = 240s; # High values due to expected slow responses
.first_byte_timeout = 240s; # High values due to expected slow responses
.between_bytes_timeout = 240s; # High values due to expected slow responses
.probe = healthcheck;
}
backend web2 {
.host = "************";
.port = "8080";
.connect_timeout = 240s; # High values due to expected slow responses
.first_byte_timeout = 240s; # High values due to expected slow responses
.between_bytes_timeout = 240s; # High values due to expected slow responses
.probe = healthcheck;
}
backend web3 {
.host = "************";
.port = "8080";
.connect_timeout = 240s; # High values due to expected slow responses
.first_byte_timeout = 240s; # High values due to expected slow responses
.between_bytes_timeout = 240s; # High values due to expected slow responses
.probe = healthcheck;
}
backend web4 {
.host = "************";
.port = "8080";
.connect_timeout = 240s; # High values due to expected slow responses
.first_byte_timeout = 240s; # High values due to expected slow responses
.between_bytes_timeout = 240s; # High values due to expected slow responses
.probe = healthcheck;
}
backend web5 {
.host = "************";
.port = "8080";
.connect_timeout = 240s; # High values due to expected slow responses
.first_byte_timeout = 240s; # High values due to expected slow responses
.between_bytes_timeout = 240s; # High values due to expected slow responses
.probe = healthcheck;
}
director backend_director round-robin {
{ .backend = web1; }
{ .backend = web2; }
{ .backend = web3; }
{ .backend = web4; }
{ .backend = web5; }
}
sub vcl_recv {
set req.backend = backend_director;
# loads more stuff
}
Can anyone shed light on why the round robin director would so heavily favour the first defined backend, or what might cause a bypassing of the director entirely? I have already ensured that return(pipe)
is not used in vcl_recv.