Varnish sometimes (up to ~5% of the requests) triggers err_too_many_redirects, running in docker combined with plesk for wordpress

Question

We've followed these instructions https://support.plesk.com/hc/en-us/articles/115001888894-Does-Plesk-support-Varnish- and were able to get it to work 'properly', it serves the wordpress site through varnish and the speed is amazing. All good and well, but our uptime robot reports short downtime every now and then (very irregular, but between 1 and 8 times a day for a period of <5 minutes, on various times throughout the day).

After some various testing we've noticed an issue with the wp-admin url, we've concluded that this occurs because of our DirectoryIndex (which is index.html by default), we've resolved this by adding 'DirectoryIndex index.php' to our 'Additional Apache directives' which resolved that issue.

One thing that seemed interesting to us is that when we change how PHP(FPM) is served, it affects the same error. When we serve through NGINX we'll get err_too_many_redirects aswell, if we serve it through Apache it works 95% of the time.

I've added some (varnishncsa) logging below, when the header response states 301 that's when we're in the loop:

172.17.0.1 - - [11/Jan/2022:13:54:55 +0000] "HEAD http://[the varnish domain]/ HTTP/1.0" 200 0 "https://[the varnish domain]" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.[uptimerobot].com/)"

172.17.0.1 - - [11/Jan/2022:13:56:51 +0000] "GET http://[the varnish domain]/ HTTP/1.0" 200 11184 "-" "-"

172.17.0.1 - - [11/Jan/2022:13:58:31 +0000] "HEAD http://[the varnish domain]/ HTTP/1.0" 301 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36"

172.17.0.1 - - [11/Jan/2022:13:58:32 +0000] "HEAD http://[the varnish domain]/ HTTP/1.0" 301 0 "http://[the varnish domain]" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36"

After this we the following log 8 times: 172.17.0.1 - - [11/Jan/2022:13:58:32 +0000] "HEAD http://[the varnish domain]/ HTTP/1.0" 301 0 "https://[the varnish domain]/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36"

Then one: 172.17.0.1 - - [11/Jan/2022:13:58:34 +0000] "GET http://[the varnish domain]/ HTTP/1.0" 301 238 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/96.0.4664.110 Safari/537.36"

Then 7 times: 172.17.0.1 - - [11/Jan/2022:13:59:55 +0000] "HEAD http://[the varnish domain]/ HTTP/1.0" 301 0 "https://[the varnish domain]" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.[uptimerobot].com/)"

Then 22 times: 172.17.0.1 - - [11/Jan/2022:14:00:17 +0000] "GET http://[the varnish domain]/ HTTP/1.0" 301 238 "https://[the varnish domain]" "Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.[uptimerobot].com/)"

Then right after that everything seems fine again.

Update: inspecting the logs

@thijs-feryn pointed out something interesting, which made us conclude that we've dismissed an older theory too soon.Since we felt sure that X-Forwarded-Proto was passed on, because it was in our 'additional apache header configuration'.

However after reading his response carefully we've learned that somehow X-Forwarded-Proto doesn't get passed every request (atleast not on the broken ones). He pointed out that [another varnish domain] seemed to pass this record just fine in a similar request.

After some carefull comparing of the configurations only one thing stood out to us, which was that the preffered domain in plesk for [another varnish domain] was set to 'None', and for [the varnish domain] is was set to [the varnish domain].

So we went to www.[the varnish domain] in our browser, and that seemed to trigger the issue. We switched 'Preferred domain' to 'None' here aswell and now www.[the varnish domain] redirects to [the varnish domain] just fine.

hypothesis

It seems that plesk's own redirect ignores the 'Additional directives for HTTP' and thus 'Vary: X-Forwarded-Proto' wasn't added. We'll be monitoring this and we'll post an update soon, when we're fully confident that this was the cause.

Thijs Feryn · Accepted Answer · 2022-01-13T10:50:20.283

The problem could be related to the lack of protocol awareness when it comes to caching. It's quite typical that an HTTP request that should be redirected to an HTTPS request ends up in the cache, which unconditionally redirects users to the HTTPS version, even if they actually requested an HTTPS version.

There are various ways to fix this depending on the type of webserver you're using.

varnishlog output

But before we can jump to conclusions, I want to see some varnishlog output.

I'd like to see output of the following command when the redirect loop takes place:

varnishlog -g request -q "ReqUrl eq '/'"

The assumption is that the issue also occurs on the homepage that we added as a VSL query to the varnishlog command.

I noticed your varnishncsa output, but unfortunately it too limited in terms of output. varnishlog is a lot better for debugging whereas varnishncsa is just.

Testing the hypothesis

If the redirect loop is indeed caused by the lack of protocol awareness, we can trigger the issue as follows:

Run varnishadm ban obj.status "!=" 0 to empty the cache
Call the plain HTTP URL of the website to ensure this version is cached
Call the HTTPS URL of the website to check whether or not you're stuck in the redirect loop

Fixing the issue

If all the tests add up and the hypothesis is proven, the solution is actually quite simple.

If you're using Apache as the webserver, you can add the following content to your .htaccess file:

SetEnvIf X-Forwarded-Proto "https" HTTPS=on
Header append Vary: X-Forwarded-Proto

Otherwise you can also add the following code to your VCL file:

sub vcl_backend_response {
    if(beresp.http.Vary) {
        if(beresp.http.Vary !~ "X-Forwarded-Proto") {
            set beresp.http.Vary = set beresp.http.Vary + ", X-Forwarded-Proto";
        }
    } else {
        set beresp.http.Vary = "X-Forwarded-Proto";
    }
}

Adding X-Forwarded-Proto to the Vary header will ensure that Varnish creates separate objects in the cache for HTTP and for HTTPS.

I also assume that your TLS proxy actually sends an X-Forwarded-Proto header.

Update: inspecting the logs

After some back and forth in the comments, I received some insightful logs via https://pastebin.com/QzPh1r5R that explain what is going on.

In ** << BeReq >> 951078 you can see the X-Forwarded-Proto: http header. This means the request was sent through a plain HTTP request. This type of request should result in a 301 redirect, and it does according to the following log lines:

--  BerespProtocol HTTP/1.1
--  BerespStatus   301
--  BerespReason   Moved Permanently
--  BerespHeader   Date: Thu, 13 Jan 2022 08:55:17 GMT
--  BerespHeader   Server: Apache
--  BerespHeader   Location: https://[the varnish domain]/
--  BerespHeader   Content-Length: 238
--  BerespHeader   Content-Type: text/html; charset=iso-8859-1
--  TTL            RFC 120 10 0 1642064118 1642064118 1642064117 0 0 cacheable

Not only does the redirection take place, it also gets cached for 120 seconds. It's also interesting to see that there is no Vary: X-Forwarded-Proto header.

At a later time, the * << Request >> 951080 transaction pops up in the logs. We see the following request headers:

-   ReqHeader      X-Forwarded-Proto: https
-   ReqHeader      Host: [another varnish domain]

So this is an HTTPS request for another Varnish domain and for some reason it results in a cache hit that returns a Vary header:

-   RespHeader     Vary: Accept-Encoding,Cookie,X-Forwarded-Proto
-   VCL_call       HIT
-   Hit            886585 90003.966201 10.000000 0.000000

Not only do you see the hit, you can also conclude that the object was inserted in the cache by transaction 886585.

Although it's good to have a Vary header, this one contains the Cookie value which is dangerous. This means that a separate object will be stored in the cache for every possible cookie value that is presented to Varnish. This is quite dangerous, so please remove Cookie from the Vary header and stick with Vary: Accept-Encoding, X-Forwarded-Proto.

The last transaction I looked at is * << Request >> 951082. It has a X-Forwarded-Proto: https request header and should not result in a 301 redirection, but unfortunately it does.

The Hit tag exposes some interesting information:

-   Hit            951078 82.391648 10.000000 0.000000

The object that was hit, was originally inserted in the cache by transaction 951078. If you paid close attention, that was the first one we covered. This transaction was an HTTP-only transaction that resulted in the 301 redirect.

And that's exactly the object that is returned. So even though you're requesting HTTPS content, you're still redirected and that's how you end up in the endless loop.

If you look at the response that is returned by transaction 951082, you're not seeing the Vary header:

-   RespProtocol   HTTP/1.1
-   RespStatus     301
-   RespReason     Moved Permanently
-   RespHeader     Date: Thu, 13 Jan 2022 08:55:17 GMT
-   RespHeader     Server: Apache
-   RespHeader     Location: https://[the varnish domain]/
-   RespHeader     Content-Length: 238
-   RespHeader     Content-Type: text/html; charset=iso-8859-1
-   RespHeader     X-Varnish: 951082 951078
-   RespHeader     Age: 37
-   RespHeader     Via: 1.1 varnish (Varnish/7.0)

Conclusion: please ensure that X-Forwarded-Proto is added to your Vary header. It will prevent getting stuck in the redirect loop.

First of all, thank you for your detailled response. I've tried following the steps described in 'Testing the hypothesis' and made a log of that test using varnishlog (i've uploaded the log https://pastebin.com/hxtp8gX9). I'm continuously logging through varnishlog as you described, and i'll post a log of when the issue occured. >SetEnvIf X-Forwarded-Proto "https" HTTPS=on Header append Vary: X-Forwarded-Proto Is in my 'additional apache configuration', which in essence is about the same as adding it to the .htaccess file. — Remco van Grinsven, Jan 12 '22 at 09:38
@RemcovanGrinsven Keep me posted, Remco. Looking forward to some decent logging when things go wrong. — Thijs Feryn, Jan 12 '22 at 12:18
Thank you for your interrest. I was able to capture one moment of downtime in a varnishlog, the short version of the log can be found here: https://pastebin.com/QzPh1r5R I've uploaded the full log https://we.tl/t-wDfhPo57in (If you prefer another site, We'll upload it there) but that's a lot of logging. Looking forward to hearing your opinion. — Remco van Grinsven, Jan 13 '22 at 09:40
@RemcovanGrinsven I think I found it. I updated my response and added my conclusion. Have a look. — Thijs Feryn, Jan 13 '22 at 10:50
thank you very much for your input. Based on your input we where able to find something suspicious. We're currently testing a possible solution (I've updated our original post for more details). — Remco van Grinsven, Jan 13 '22 at 14:01
The issue hasn't occured over the entire weekend, that's either a huge coincidence or it's safe to conclude that the matter is resolved now. I'll accept your input as the sollution for others who might have something similair. Your input was very insightfull! — Remco van Grinsven, Jan 17 '22 at 15:44

Varnish sometimes (up to ~5% of the requests) triggers err_too_many_redirects, running in docker combined with plesk for wordpress

1 Answers1

varnishlog output

Testing the hypothesis

Fixing the issue

Update: inspecting the logs