10

We've got nginx running on Ubuntu Trusty. It serves several websites over https, running on one ip address.

Randomly, although it seems slightly related to work load, sometimes single requests turn up on the wrong vhost. This leads to requests on lustrum.thalia.nu being served by thalia.nu and vice-versa. This then gives nasty error pages as users suddenly end up on a different web site. When you press F5, users then end up on the original target again.

It does not seem browser or operating system related. It's been confirmed to happen on Firefox (Linux, Windows, Mac), Edge (Windows) and Chrome (Linux, Windows, Android) and Safari (iOS).

The issue appears to occur more frequently when the system is put under load, suggesting some sort of race condition.

lustrum.thalia.nu

server {
        server_name lustrum.thalia.nu;

        listen 443 ssl;

        ssl on;
        ssl_certificate /etc/nginx/certs/lustrum.thalia.nu.crt;
        ssl_certificate_key /etc/nginx/certs/lustrum.thalia.nu.key;

        add_header Strict-Transport-Security "max-age=63072000; preload";

        root /var/www/thalia-lustrum/public_html;

        location / {
                index index.php;
                try_files $uri $uri/ /index.php?$args;
        }

        # Add trailing slash to */wp-admin requests.
        rewrite /wp-admin$ $scheme://$host$uri/ permanent;

        # Pass all .php files onto a php-fpm/php-fcgi server.
        location ~ [^/]\.php(/|$) {
                include         /etc/nginx/fastcgi_params;

                fastcgi_split_path_info ^(.+?\.php)(/.*)$;

                if (!-f $document_root$fastcgi_script_name) {
                        return 404;
                }

                fastcgi_pass    unix:/var/run/php5-fpm-thalia-lustrum.sock;
                fastcgi_index   index.php;
                fastcgi_param   SCRIPT_FILENAME  /public_html$fastcgi_script_name;
        }
}

thalia.nu

server {
        server_name thalia.nu;    
        listen 443 ssl;

        ssl on;
        ssl_certificate /etc/nginx/certs/www.thalia.nu.crt;
        ssl_certificate_key /etc/nginx/certs/www.thalia.nu.key;

        add_header Strict-Transport-Security "max-age=63072000; preload";

        root /var/www/thalia/public_html;

        location / {
                try_files $uri $uri/ /index.php/$request_uri;
                index index.php index.html index.htm;
        }

        location ~ \.php($|/) {
                include         /etc/nginx/fastcgi_params;
                set  $script     $uri;
                set  $path_info  "";
                if ($uri ~ "^(.+\.php)(/.+)") {
                                set  $script     $1;
                                set  $path_info  $2;
                }
                fastcgi_read_timeout    120;
                fastcgi_pass    unix:/var/run/php5-fpm-thalia-www.sock;
                fastcgi_index   index.php;
                fastcgi_param   SCRIPT_FILENAME  /public_html$fastcgi_script_name;
        }
}

As you can see, we're running different PHP5-FPM pools for these two domains. These pools are chrooted to different folders and run as different users. PHP-FPM's configuration are otherwise fairly standard as far as I can tell.

We've tried both nginx 1.4.6-ubuntu3 and nginx 1.8.0-1+trusty.

Log telemetry

266.266.266.266 - - [25/Nov/2015:09:24:40 +0100] "GET /committees/175 HTTP/1.1" 302 5 "https://thalia.nu/committees" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0" Host: "thalia.nu" Location: "https://thalia.nu/index.php//committees/wp-admin/setup-config.php"

In this line you can see that the request for the page /committees suddenly gets redirected to wp-admin. This appears that the request for /committees got handled by the thalia-lustrum PHP-fpm pool...

DNS zone file

We don't see how this can possibly be relevant, but...

;; MX Records
thalia.nu.    300    IN    MX    20    relay.transip.nl.
thalia.nu.    300    IN    MX    10    ivo.thalia.nu.

;; TXT Records
thalia.nu.    300    IN    TXT    "v=spf1 a mx a:mulgore.hexon-is.nl a:moonray.hexon-is.nl a:fred.thalia.nu a:ivo.thalia.nu ~all"

;; SPF Records (Sender Policy Framework)
thalia.nu.    300    IN    SPF    "v=spf1 a mx a:mulgore.hexon-is.nl a:moonray.hexon-is.nl a:fred.thalia.nu a:ivo.thalia.nu ~all"

;; CNAME Records
lustrum.thalia.nu.    300    IN    CNAME    thalia.nu.

;; A Records (IPv4 addresses)
thalia.nu.    300    IN    A    131.174.31.8
www.thalia.nu.    300    IN    A    131.174.31.8
ivo.thalia.nu.    300    IN    A    131.174.31.8
Thom Wiggers
  • 292
  • 1
  • 13
  • 1
    Plesse check your DNS settings for the domains. – Diamond Nov 18 '15 at 20:39
  • 1
    @bangal they are an A and a CNAME record, pointing to the same IP. I do not see how this is related, though; these resolve just fine, and it seems unlikely that a DNS issue would manifest so inconsistently. – Joost Nov 18 '15 at 22:23
  • 1
    Do you have more that one nginx behind a load balancer? If so having different configurations may give the symptoms you're having. Then, what makes nginx choose which site to serve is choosen based on the `Host: ` http header (Name based virtual hosting) – Fredi Nov 24 '15 at 11:03
  • Negative, only one nginx is running. There are several php-fpm processes though. – Thom Wiggers Nov 24 '15 at 11:04
  • Your DNS server does not support ANY queries. That makes debugging DNS related problems harder. Could you include the full zone in your question (assuming it isn't too large)? – kasperd Nov 24 '15 at 11:55
  • 2
    @ThomWiggers, can you add to your log file the content of the `Host:` http header and user agent? See here for how: http://serverfault.com/questions/636790/nginx-log-complete-request-response-with-all-headers. Actually i tried makeing some requests to your websites but culdn't reproduce your problem. What client are you using for reproducing this? – Fredi Nov 24 '15 at 13:15
  • @Fredi: the clients used are covered in the question. It's happening on only a small number of requests. – Thom Wiggers Nov 24 '15 at 16:44
  • @kasperd I don't see how this can possibly be relevant, but `lustrum.thalia.nu. 300 IN CNAME thalia.nu. thalia.nu. 300 IN A 131.174.31.8` – Thom Wiggers Nov 24 '15 at 16:45
  • @ThomWiggers RFC 6555 comes to my mind as one possible explanation for a server seemingly randomly handing out different answers some of the time. That is easy to forget to check for but I surely would remember that possibility if I was presented with the full list of records of all types for the domain. There might be other relevant record types that are easy to forget about, but I can't remember any at the moment. – kasperd Nov 24 '15 at 17:21
  • We don't have ipv6 connectivity in this rack, so that rfc does not seem applicable. – Thom Wiggers Nov 24 '15 at 17:36
  • @ThomWiggers What other record types do you have that do make a difference? – kasperd Nov 24 '15 at 19:18
  • I still don't see the `Host` header logged. If Nginx makes a mistake in vhost selection, you'd think the host header is also wrong. – Halfgaar Nov 25 '15 at 08:16
  • 3
    Is the fact that I just got "Third party content not installed" or something because you're working on it, or did I end up at another PHP pool or something (triggering the same bug)? I also got a brief error about `config.php` not found. – Halfgaar Nov 25 '15 at 08:26
  • Yup, you've encountered this issue. I'll extract the logging info later. – Thom Wiggers Nov 25 '15 at 08:39
  • If Evil perhaps? https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/ only safe functions are return and rewrite inside location blocks – Drifter104 Nov 25 '15 at 18:02
  • @Drifter104 While that's a good hint, I've refactored out all of our `if` blocks in all `server` blocks, and I can still reproduce it. – Thom Wiggers Nov 25 '15 at 18:38
  • @kasperd for completeness, the relevant zone entries are now included. – Joost Nov 25 '15 at 18:39
  • @ThomWiggers I also saw the `Third party libraries not installed. Make sure that composer has required libraries in the concrete/ directory.` message multiple times on `lustrum.thalia.nu` (using code 200 even though there was an error). And `thalia.nu` sometimes redirects to `https://thalia.nu/wp-admin/setup-config.php`. But I don't see any indication that requests are directed to the wrong vhost. – kasperd Nov 25 '15 at 19:10
  • @kasperd The redirect to `wp-admin` is code from the `lustrum.thalia.nu` chroot, while the Composer error is code from the `thalia.nu` chroot... Something is going wrong there. – Thom Wiggers Nov 25 '15 at 19:14
  • @ThomWiggers Does the problem only affect php scripts or does it also affect static content? – kasperd Nov 25 '15 at 19:17
  • I don't see any answer to the question by @Fredi about load balancing. – kasperd Nov 25 '15 at 19:19
  • 2
    @kasperd http://serverfault.com/questions/737349/webserver-randomly-serves-different-vhosts#comment922812_737349. It does appear to only affect PHP scripts. – Thom Wiggers Nov 25 '15 at 20:17

4 Answers4

4

After hours of debugging this issue we've finally been able to trace it to the cause. It appears the cause isn't nginx, but PHP-fpm. We're running php5-fpm version 5.5.9-1ubuntu4.14. It appears that when forking new workers, something sometimes goes wrong and the workers run (part?) of the code of different workers.

Our solution was to copy /etc/php5/fpm/php5-fpm.conf to different copies with their own pool.d folders, then to copy /etc/init.d/php5-fpm to launch with the new config file (also creating files in /etc/init/). This means we now have a php5-fpm process manager per pool. Having seperate chroots and sockets don't appear to keep things separate enough.

Thom Wiggers
  • 292
  • 1
  • 13
  • Note that it is currently unclear if this is an issue in our configuration or in (this version of) php5-fpm, although the latter does not seem likely given the lack of similar reports. If we end up finding the reason why this problem occurs, this answer will be updated. – Joost Nov 26 '15 at 10:42
2

I am facing the same issue but on Debian with Apache2.4.25 and PHP7.1-FPM. Here is a way to separate processes https://ma.ttias.be/a-better-way-to-run-php-fpm/

For those like me who might find this solution too heavy for small websites, add php_admin_value[opcache.revalidate_freq] = 0 at the end of the php-fpm pool configuration file. However, that may have a serious impact on performances...

Here is the official bug report: https://bugs.php.net/bug.php?id=67141

Nic0tiN
  • 21
  • 2
0

Does Nginx support SNI? You can run nginx -V and should see something like TLS SNI support enabled. If you don't, that may be why because the hostname is sent after the handshake and I'm assuming you have a wildcard certificate for *.thalia.nu

Mugurel
  • 903
  • 1
  • 9
  • 17
  • Of course it does, without SNI this would go wrong 100% of the time instead of very occasionally. (and I've also checked this, it is definitely enabled) – Thom Wiggers Nov 24 '15 at 11:02
  • FWIW, note that we do not serve a wildcard certificate, but use individual certificates for the separate subdomains. This is included in the configuration listed in the question. – Joost Nov 24 '15 at 11:12
  • ..although the lustrum.thalia.nu certificate is also valid for Thalia.nu – Thom Wiggers Nov 24 '15 at 11:17
  • Can you try adding the includeSubDomains parameter like this? add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"; – Mugurel Nov 24 '15 at 11:18
  • @ThomWiggers If the certificate is valid for multiple domains it is possible to support multiple domains on a single IP without the need for SNI. – kasperd Nov 24 '15 at 11:42
  • Have you reproduced this behaviour with a client which you KNOW supports SNI / is there a correlation between the user-agent and occurrences? – symcbean Nov 24 '15 at 14:27
  • @symcbean This is covered in the question: it occurs with every major client on every major OS. – Thom Wiggers Nov 24 '15 at 16:43
  • @Mugurel: no, we can't add that header and I also don't see how HSTS would be related. – Thom Wiggers Nov 24 '15 at 17:14
-1

It seems that the certificate is not right: firefox is telling me that it is issued for www.thalia.nu, not thalia.nu.

This is IMHO what is causing trouble. Try with another certificate or try activating HTTP connections without SSL.

Xavier Nicollet
  • 620
  • 4
  • 10
  • We cannot reproduce that. The certificate served at `www.thalia.nu` and `thalia.nu` include both domains, with and without `www`. What Firefox version are you using, and on what platform? – Joost Nov 25 '15 at 19:08