Lately we've been running into some issues with our php-fpm processes spinning out of control and causing the site to become unresponsive. There's some obvious php-fpm configuration tooling that needs to be done, but I'd also like to implement a reasonable livenessProbe health check for the php-fpm container that will restart the container when the probe fails.
I've dug up several resources on how to ping the server as a health check (e.g. https://easyengine.io/tutorials/php/fpm-status-page/), but I have yet to find a good answer on what to be on the lookout for. Will the /ping route return something other than 'pong' if the server is effectively dead? Will it just time out? Assuming the latter, what is a reasonable timeout limit?
Running some tests of my own, I notice that a healthy php-fpm server will return the 'pong' response quickly:
# time curl localhost/ping
pong
real 0m0.040s
user 0m0.006s
sys 0m0.001s
I simulated heavy load and indeed it took 1-3 seconds for the 'pong' response, and that coincided with the site becoming unresponsive. Based on that I drew up a draft of a livenessProbe that will fail and restart the container if the liveness probe script takes longer than 2 seconds on 2 consecutive probes:
livenessProbe:
exec:
command:
- sh
- -c
- timeout 2 /var/www/livenessprobe.sh
initialDelaySeconds: 15
periodSeconds: 3
successThreshold: 1
failureThreshold: 2
And the probe script is simply this (There are reasons why this needs to be a shell script and not a direct httpGet from the livenessProbe that I won't get into):
#!/bin/bash
curl -s localhost/ping
Now I don't know if I'm being too aggressive or too conservative. I'll be running a canary deploy to test this, but in the meantime I'd like to get some feedback from others that have implemented health checks on php-fpm servers, bonus points if it's in a Kubernetes context.