I have a webserver container running in App Engine which serves a REST API. I've tried to go for a relatively standard implementation - nginx + PHP-FPM using TCP sockets (I haven't gotten a unix socket to work for some reason). The DB connection is also a TCP socket which runs over a Google Cloud VPN.
The Problem
I'm getting ~25% availability on the API. Often the request will 504 Gateway Timeout
after the max time (App Engine's nginx proxy is set to 60s). Sometimes it will 502 Bad Gateway
if PHP-FPM times out (request_terminate_timeout
).
I'm trying to work out whether this is the App Engine nginx being badly configured, my nginx, or my PHP-FPM config. Nginx should be either closing the sockets or reusing them, but it seems to be doing neither.
Client-side
When I siege
any given endpoint (25 users) for a few minutes, I see:
HTTP/1.1 504 60.88 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.23 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.68 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 60.76 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.06 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 33.35 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 32.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 36.61 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 39.00 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 42.47 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 48.51 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 56.45 secs: 13143 bytes ==> GET /path/to/rest
# Another run
HTTP/1.1 200 7.65 secs: 13143 bytes ==> GET /path/to/rest
...10 lines...
HTTP/1.1 200 8.20 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 8.30 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 61.15 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.46 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.33 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.25 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 53.63 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 48.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.31 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.27 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.26 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.08 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.56 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.83 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.85 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 58.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.21 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.61 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.65 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.13 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.96 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.48 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.81 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.89 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.26 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.80 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.44 secs: 166 bytes ==> GET /path/to/rest
This also happens with only 1 user:
HTTP/1.1 502 55.43 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.71 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.21 secs: 166 bytes ==> GET /path/to/rest
Server-side
Nginx logs for each case:
# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...
# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
This is what netstat -t
looks like:
# Before starting
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
# During the siege
tcp 0 0 localhost:56144 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53073 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 localhost:56148 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53071 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 localhost:56106 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56191 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56113 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56150 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53102 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:53051 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56146 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56117 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56179 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56160 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56168 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56170 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56111 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56115 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56123 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56109 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56113 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56140 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56181 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56121 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56191 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56119 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56142 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56106 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56110 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56144 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56148 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56150 ESTABLISHED
# A minute or so after ending the siege
tcp 0 0 5c2ad0938ce9:53319 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34578 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34576 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34570 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56396 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56396 ESTABLISHED
Configs
nginx.conf
user www-data;
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 512;
}
http {
server_tokens off;
fastcgi_ignore_client_abort off;
keepalive_timeout 650;
keepalive_requests 10000;
gzip on;
..more gzip settings..
server {
charset utf-8;
client_max_body_size 512M;
listen 8080;
rewrite_log on;
root /app/web;
index index.php;
location / {
try_files $uri /index.php?$args;
}
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
include /etc/nginx/fastcgi_params;
fastcgi_keep_conn off;
fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
}
}
include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}
php-fpm (www.conf)
[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45
More info
Disabling keepalive is a bad idea, as App Engine constantly polls the container for health checks and this creates a lot of dead TIME_WAIT
sockets (I tried it).
Prior to request_terminate_timeout
there were a lot of CLOSE_WAIT
sockets instead of TIME_WAIT
. Setting request_terminate_timeout = 45
does help in a sense, as the worker process gets killed and serves a 200
again after respawning. A lower terminate timeout will just generate more 502
s and less 504
s.
The process_idle_timeout
is ignored as the sockets are not technically idle.
Setting fastcgi_keep_conn on
has no measurable impact on nginx's behaviour.