2

I have an architecture with two varnish servers sitting in front of 5 webheads. Each varnish server is configured with a round robin backend director, but at times of moderate to high load varnish seems to be heavily favouring the first defined backend in the list.

Varnish version is 3.0.5.

If the first backend is marked as sick the second backend in the list is heavily favoured, and so on.

varnish> backend.list
200
Backend name                   Refs   Admin      Probe
web1(************,,8080)       102    probe      Healthy 8/8
web2(************,,8080)       17     probe      Healthy 8/8
web3(************,,8080)       9      probe      Healthy 8/8
web4(************,,8080)       17     probe      Healthy 8/8
web5(************,,8080)       12     probe      Healthy 8/8

Some parts of the VCL that might be pertinent:

probe healthcheck {
   .request =
         "GET /LICENSE.txt HTTP/1.1"
         "Host: **********.co.uk"
         "Connection: close";
   .interval = 120s;
   .timeout = 90s; # High values due to expected slow responses
   .window = 8;
   .threshold = 3;
   .initial = 3;
   #.expected_response = 200; # Still want the Magento maintenance page to display so no response code check
}

backend web1 {
    .host = "************";
    .port = "8080";
    .connect_timeout = 240s; # High values due to expected slow responses
    .first_byte_timeout = 240s; # High values due to expected slow responses
    .between_bytes_timeout = 240s; # High values due to expected slow responses
    .probe = healthcheck;
}
backend web2 {
    .host = "************";
    .port = "8080";
    .connect_timeout = 240s; # High values due to expected slow responses
    .first_byte_timeout = 240s; # High values due to expected slow responses
    .between_bytes_timeout = 240s; # High values due to expected slow responses
    .probe = healthcheck;
}
backend web3 {
    .host = "************";
    .port = "8080";
    .connect_timeout = 240s; # High values due to expected slow responses
    .first_byte_timeout = 240s; # High values due to expected slow responses
    .between_bytes_timeout = 240s; # High values due to expected slow responses
    .probe = healthcheck;
}
backend web4 {
    .host = "************";
    .port = "8080";
    .connect_timeout = 240s; # High values due to expected slow responses
    .first_byte_timeout = 240s; # High values due to expected slow responses
    .between_bytes_timeout = 240s; # High values due to expected slow responses
    .probe = healthcheck;
}
backend web5 {
    .host = "************";
    .port = "8080";
    .connect_timeout = 240s; # High values due to expected slow responses
    .first_byte_timeout = 240s; # High values due to expected slow responses
    .between_bytes_timeout = 240s; # High values due to expected slow responses
    .probe = healthcheck;
}

director backend_director round-robin {
    { .backend = web1; }
    { .backend = web2; }
    { .backend = web3; }
    { .backend = web4; }
    { .backend = web5; }
}

sub vcl_recv {
    set req.backend = backend_director;

    # loads more stuff
}

Can anyone shed light on why the round robin director would so heavily favour the first defined backend, or what might cause a bypassing of the director entirely? I have already ensured that return(pipe) is not used in vcl_recv.

shanethehat
  • 143
  • 9
  • Do you mind sharing your `healthcheck`? How do response times from your backends compare? Are 'sticky sessions' in the picture? – KM. Apr 02 '14 at 16:32
  • @KM. Added the healthcheck. Not got data available for response times, but the imbalance in the backend list is reflected in the server load, which is significantly higher on whichever server varnish is favouring. – shanethehat Apr 02 '14 at 21:26
  • As a test, would it be possible to try using a random director with equal weight? "Equal weight means equal traffic" -- from: https://www.varnish-cache.org/docs/3.0/reference/vcl.html#the-family-of-random-directors . How are you handling session cookies (if any)? – KM. Apr 03 '14 at 13:26
  • We were originally using a random director, and switched to round-robin in an attempt to fix the issue. It made no difference. Once a user is granted a session we start sending a nocache cookie that causes a `return(pass)`. On the backend sessions are handled by a central memcached instance so sticky sessions are not required. – shanethehat Apr 03 '14 at 13:43
  • A theory has been floated that the excessively long timeout in the healthcheck might be the cause of the issue. – shanethehat Apr 04 '14 at 14:29
  • A 90s timeout for a flat text file seems too long. Usually you want your healthcheck to hit a URI that loads enough components (web server, db, caching, etc) to ensure the backend is up. How has varnish reacted to lower timeouts? – KM. Apr 08 '14 at 14:32
  • We haven't tried that yet. It's next on the list, because we're starting to suspect that the issue lies with the webhead in question and the overly long healthcheck is stopping varnish realising something is wrong. – shanethehat Apr 08 '14 at 15:35
  • I'm having more or less the same issue. Varnish 3.0.5, two backends. The first one gets hit about 85% of the time. I've tried round-robin and equal-weighted random. No dice. Did you find a resolution? – astrostl Aug 20 '14 at 20:44
  • Update: 'fgs' on irc.linpro.no's #varnish IRC channel checked out our VCL and made a great diagnosis - we had a set req.backend that was conditional upon a given req.http.host, but a recent domain change meant that we had a lot more traffic going to other req.http.hosts. Since that wasn't captured by the "if" statement, a lot of requests were going to a default backend: web01, the first one that happened to be defined. Cold comfort for you, though, as your req.backend setting isn't conditional. Maybe something else in the VCL you didn't paste is changing it, though? – astrostl Aug 21 '14 at 18:27
  • Just as something to try, what happens if you change the order of servers in the director configuration? – Kirrus Nov 07 '14 at 22:15
  • Did you ever solve this? Curious how if so! – Kirrus Apr 22 '16 at 18:35
  • Unfortunately not, we ended up scaling other parts of the system to relieve the pressure on the backends. Never figured out why Varnish seemed to favour certain ones. – shanethehat Apr 25 '16 at 08:55

0 Answers0