1

I have a webserver container running in App Engine which serves a REST API. I've tried to go for a relatively standard implementation - nginx + PHP-FPM using TCP sockets (I haven't gotten a unix socket to work for some reason). The DB connection is also a TCP socket which runs over a Google Cloud VPN.

The Problem

I'm getting ~25% availability on the API. Often the request will 504 Gateway Timeout after the max time (App Engine's nginx proxy is set to 60s). Sometimes it will 502 Bad Gateway if PHP-FPM times out (request_terminate_timeout).

I'm trying to work out whether this is the App Engine nginx being badly configured, my nginx, or my PHP-FPM config. Nginx should be either closing the sockets or reusing them, but it seems to be doing neither.

Client-side

When I siege any given endpoint (25 users) for a few minutes, I see:

HTTP/1.1 504    60.88 secs:     176 bytes ==> GET  /path/to/rest
                       ...15 lines...
HTTP/1.1 504    61.23 secs:     176 bytes ==> GET  /path/to/rest

HTTP/1.1 200    57.54 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    57.68 secs:   13143 bytes ==> GET  /path/to/rest

HTTP/1.1 504    60.76 secs:     176 bytes ==> GET  /path/to/rest
                       ...15 lines...
HTTP/1.1 504    61.06 secs:     176 bytes ==> GET  /path/to/rest

HTTP/1.1 200    33.35 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    32.97 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    36.61 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    39.00 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    42.47 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    48.51 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200    56.45 secs:   13143 bytes ==> GET  /path/to/rest

# Another run
HTTP/1.1 200     7.65 secs:   13143 bytes ==> GET  /path/to/rest
                       ...10 lines...
HTTP/1.1 200     8.20 secs:   13143 bytes ==> GET  /path/to/rest

HTTP/1.1 502    47.15 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    47.15 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     8.30 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 504    61.15 secs:     176 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.46 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.33 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.25 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    53.63 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    48.40 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.31 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.97 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.27 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.26 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    54.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.08 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.56 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.83 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    60.85 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    58.99 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.40 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.21 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.61 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 502    52.65 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.13 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.96 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.48 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.81 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.89 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.26 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     6.80 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.44 secs:     166 bytes ==> GET  /path/to/rest

This also happens with only 1 user:

HTTP/1.1 502    55.43 secs:     166 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.71 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 200     7.54 secs:   13143 bytes ==> GET  /path/to/rest
HTTP/1.1 502    59.21 secs:     166 bytes ==> GET  /path/to/rest

Server-side

Nginx logs for each case:

# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...

# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"

# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"

This is what netstat -t looks like:

# Before starting
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 

# During the siege    
tcp        0      0 localhost:56144         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34565        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53073      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 localhost:56148         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53071      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34580        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34584        ESTABLISHED 
tcp        0      0 localhost:56106         localhost:9000          ESTABLISHED 
tcp        0      0 localhost:56191         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34566        ESTABLISHED 
tcp        0      0 localhost:56113         localhost:9000          ESTABLISHED 
tcp        0      0 localhost:56150         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34591        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34574        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53102      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:53051      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34572        ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56146         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56117         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56179         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56160         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56168         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56170         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56111         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56115         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56123         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56109         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56113         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56140         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56181         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56121         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56191         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56119         TIME_WAIT   
tcp        0      0 localhost:9000          localhost:56142         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56106         ESTABLISHED 
tcp        0      0 localhost:9000          localhost:56110         TIME_WAIT   
tcp        8      0 localhost:9000          localhost:56144         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56148         ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56150         ESTABLISHED 

# A minute or so after ending the siege
tcp        0      0 5c2ad0938ce9:53319      192.168.2.29:postgresql ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34578        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34576        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34570        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34565        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:33971        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34580        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34584        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34566        ESTABLISHED 
tcp        0      0 localhost:56396         localhost:9000          ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34591        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34574        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34072        ESTABLISHED 
tcp        0      0 5c2ad0938ce9:http-alt   172.18.0.3:34572        ESTABLISHED 
tcp        8      0 localhost:9000          localhost:56396         ESTABLISHED 

Configs

nginx.conf

user  www-data;
worker_processes  auto;
worker_cpu_affinity auto;
events {
    worker_connections  512;
}
http {
    server_tokens   off;
    fastcgi_ignore_client_abort off;
    keepalive_timeout  650;
    keepalive_requests 10000;
    gzip on;
    ..more gzip settings..
    server {
        charset utf-8;
        client_max_body_size 512M;
        listen 8080;
        rewrite_log on;
        root        /app/web;
        index       index.php;
        location / {
            try_files $uri /index.php?$args;
        }
        location ~ \.php$ {
            fastcgi_pass 127.0.0.1:9000;
            include /etc/nginx/fastcgi_params;
            fastcgi_keep_conn off;
            fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
        }
    }
    include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}

php-fpm (www.conf)

[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45

More info

Disabling keepalive is a bad idea, as App Engine constantly polls the container for health checks and this creates a lot of dead TIME_WAIT sockets (I tried it).

Prior to request_terminate_timeout there were a lot of CLOSE_WAIT sockets instead of TIME_WAIT. Setting request_terminate_timeout = 45 does help in a sense, as the worker process gets killed and serves a 200 again after respawning. A lower terminate timeout will just generate more 502s and less 504s.

The process_idle_timeout is ignored as the sockets are not technically idle.

Setting fastcgi_keep_conn on has no measurable impact on nginx's behaviour.

  • Most likely it is the application that is unable to keep up with the requests coming to it. You need to check the application's logging and performance. – Tero Kilkanen Apr 25 '17 at 15:50
  • Added examples of logs in each case. –  Apr 26 '17 at 03:27
  • Have you added Nginx logs but labeled them PHP logs? Looks like PHP isn't returning data quickly enough and Nginx times out. – Tim Apr 26 '17 at 04:01
  • Edited to nginx. Yes those are coming from nginx. PHP isn't returning quickly enough, but when you kill the PHP-FPM worker process it suddenly does return quickly enough for 1 request, and then times out again. You can see the normal execution time is ~7 seconds, so why would it take 60 seconds sometimes? In normal operation the `GET /path/to/rest` log is printed immediately, before processing. In the error cases it is never printed. –  Apr 26 '17 at 05:14
  • 2
    The problem is in the application. You need to study why the application is so slow. Once you get too many requests for the application, you have exhausted all the workers in PHP-FPM process pool and you get time outs. – Tero Kilkanen Apr 26 '17 at 09:05
  • I've enabled the PHP-FPM slowlog but there are no entries showing up (my threshold is 40 seconds). I'm still not convinced that the application is actually that slow. –  Apr 27 '17 at 01:18
  • You still haven't added nginx access logs to your question. Ideally do a few curls, and show the matching access logs, don't just give us a random segment. Make sure your access logs show $request_time. – Tim May 01 '17 at 02:23

1 Answers1

0

It turns out the problem was related to the container config, not the application. After setting the MTU to an appropriate value for Google's cloud network (1430 down from 1500), there are no longer any issues querying the application.

This was discovered by isolating the issue to only requests which opened a socket over a Google Cloud VPN to the database (see postgresql entries in netstat logs). We happened to have a VPN routing to a second VPN, whose DB connection worked perfectly as only the first hop carried high-MTU traffic.