Varnish slow when the object is cached (memory)

Question

One of my customer came with a problem of speed for varnish.

Long debug, short :
When varnish get an object from his cache (memory), it's really sluggish (> 5 seconds),
when varnish need to get the object from the apache backend, no speed problem (< 1 second).

Exemple of a slow request (from varnishlog) :

193 ReqStart     c <client ip> 59490 1329239608  
193 RxRequest    c GET  
193 RxURL        c /<my_url>/toto.png  
193 RxProtocol   c HTTP/1.1  
193 RxHeader     c Accept: */*  
193 RxHeader     c Referer: <my_referer>  
193 RxHeader     c Accept-Language: fr  
193 RxHeader     c User-Agent: <client_useragent>  
193 RxHeader     c Accept-Encoding: gzip, deflate  
193 RxHeader     c Host: <my_vhost>  
193 RxHeader     c Connection: Keep-Alive  
193 VCL_call     c recv lookup  
193 VCL_call     c hash  
193 Hash         c /<my_url>/toto.png  
193 Hash         c <my_vhost>  
193 VCL_return   c hash  
193 Hit          c 1329136358  
193 VCL_call     c hit deliver  
193 VCL_call     c deliver deliver  
193 TxProtocol   c HTTP/1.1  
193 TxStatus     c 200  
193 TxResponse   c OK  
193 TxHeader     c Server: Apache  
193 TxHeader     c Last-Modified: Mon, 18 Jun 2012 08:57:46 GMT  
193 TxHeader     c ETag: "c330-4c2bb5c0ef680"  
193 TxHeader     c Cache-Control: max-age=1200  
193 TxHeader     c Content-Type: image/png  
193 TxHeader     c Content-Length: 49968  
193 TxHeader     c Accept-Ranges: bytes  
193 TxHeader     c Date: Tue, 16 Oct 2012 06:54:03 GMT  
193 TxHeader     c X-Varnish: 1329239608 1329136358  
193 TxHeader     c Age: 391  
193 TxHeader     c Via: 1.1 varnish  
193 TxHeader     c Connection: keep-alive  
193 TxHeader     c X-Cache: HIT  
193 TxHeader     c X-Cache-Hits: 210  
193 Length       c 49968  
193 ReqEnd       c 1329239608 1350370443.778280735 1350370480.921206713 0.372072458 0.000045538 37.142880440

If I'm true, the problem is on the last line (ReqEnd),
37.142880440 it's the time in seconds to send the file.
I have the same problem on local (so it's not a bandwith problem).

The problem only happen in the morning when the maximum visitors are here (~ 400req/s).

Options for varnish :

DAEMON_OPTS="-a :80 \  
             -T localhost:6082 \  
             -f /etc/varnish/default.vcl \  
             -S /etc/varnish/secret \  
             -p thread_pool_min=100 \  
             -p thread_pool_max=1000 \  
             -p session_linger=100 \  
             -s malloc,8G"

Varnish seem to have enough ram and be fine :

SMA.s0.c_req             4303728        38.35 Allocator requests
SMA.s0.c_fail                  0         0.00 Allocator failures
SMA.s0.c_bytes      169709790476   1512443.66 Bytes allocated
SMA.s0.c_freed      168334747402   1500189.36 Bytes freed
SMA.s0.g_alloc            172011          .   Allocations outstanding
SMA.s0.g_bytes        1375043074          .   Bytes outstanding
SMA.s0.g_space        7214891518          .   Bytes available

n_wrk                      200          .   N worker threads
n_wrk_create               200         0.00 N worker threads created
n_wrk_failed                 0         0.00 N worker threads not created
n_wrk_max                    0         0.00 N worker threads limited
n_wrk_lqueue                 0         0.00 work request queue length
n_wrk_queued                26         0.00 N queued work requests
n_wrk_drop                   0         0.00 N dropped work requests

n_lru_nuked                  0          .   N LRU nuked objects
n_lru_moved            8495031          .   N LRU moved objects

Varnish is up-to-date (3.0.3-1~lenny).

If you have an idea or a track ...

The varnish configuration :

backend default {
    .host = "127.0.0.1";
    .port = "8000";
    .connect_timeout = 10s;
    .first_byte_timeout = 10s;
    .between_bytes_timeout = 5s;
}

sub vcl_recv {
  set req.grace = 1h;

  if (req.http.Accept-Encoding) {
    if (req.http.Accept-Encoding ~ "gzip") {
      set req.http.Accept-Encoding = "gzip";
    }
    else if (req.http.Accept-Encoding ~ "deflate") {
      set req.http.Accept-Encoding = "deflate";
    }
    else {
      unset req.http.Accept-Encoding;
    }
  }

  if (req.url ~ "(?i)\.(png|gif|jpeg|jpg|ico|swf|css|js|eot|ttf|woff|svg|htm|xml)(\?[a-z0-9]+)?$") {
    unset req.http.Cookie;
  }

  if (req.url ~ "^/content/.+\.xml$") {
    unset req.http.Cookie;
  }

  if (req.url ~ "^/min/") {
    unset req.http.Cookie;
  }

  if (req.restarts == 0) {
    if (req.http.x-forwarded-for) {
        set req.http.X-Forwarded-For =
        req.http.X-Forwarded-For + ", " + client.ip;
    } else {
        set req.http.X-Forwarded-For = client.ip;
    }
  }
  if (req.request != "GET" &&
    req.request != "HEAD" &&
    req.request != "PUT" &&
    req.request != "POST" &&
    req.request != "TRACE" &&
    req.request != "OPTIONS" &&
    req.request != "DELETE") {
      return (pipe);
  }
  if (req.request != "GET" && req.request != "HEAD") {
      return (pass);
  }
  if (req.http.Authorization || req.http.Cookie) {
      return (pass);
  }
  return (lookup);
}

sub vcl_fetch {
    if (req.url ~ "(?i)\.(png|gif|jpeg|jpg|ico|swf|css|js|eot|ttf|woff|svg|htm|xml)(\?[a-z0-9]+)?$") {
        unset beresp.http.set-cookie;
    }

    if (req.url ~ "^/(content|common)/.+\.xml$") {
      unset req.http.Cookie;
    }

    if (req.url ~ "^/min/") {
      unset req.http.Cookie;
    }

    set beresp.grace = 1h;

    if (beresp.ttl <= 0s ||
        beresp.http.Set-Cookie ||
        beresp.http.Vary == "*") {
        set beresp.ttl = 120s;
        return (hit_for_pass);
    }
    return (deliver);
}

sub vcl_deliver {
    if (obj.hits > 0) {
        set resp.http.X-Cache = "HIT";
        set resp.http.X-Cache-Hits = obj.hits;
    } else {
        set resp.http.X-Cache = "MISS";
    }

    return (deliver);
}

i did hit similar issue in the past, where sometimes 'hot' cached url's became very slow to retrieve and manifested only in production with high req/s. the issue was difficult to catch and i haven't managed to produce verifiable bug report that developers could act on. — ttt, Nov 19 '12 at 15:49
i think we worked around this issue by splitting load to a second varnish server. from my experience varnish has more subtle and hard to catch issues like this that manifest only in specific circumstances - i guess its just not mature enough yet. if you can describe steps to reproduce or demonstrate the problem to varnish support - go for it please :) — ttt, Nov 19 '12 at 15:58
What is the total ram off your system, I think the issue might be in `malloc`, edit: lol didn't notice how old this question is. — Mohammad AbuShady, Feb 02 '14 at 03:38

score 1 · Answer 1 · answered Feb 02 '14 at 03:49

You had

-p thread_pool_max=1000

which actually is a minimum recommended, and since you had

n_wrk_queued 26

which is an indicator that it's time to increase the threads, I believe if you had changed it to 2000 for example and kept an eye for n_wrk_queued to make sure you don't need more, things would have worked out fine.

Varnish slow when the object is cached (memory)

1 Answers1