I admin a busy webserver which uses nginx/php-fpm , connects out to mysql(RDS) and elasticsearch, but also many third parties for advertising and other plugins etc are on the site(unfortunately I’m not aware of all the specifics).
I have a random, and intermittent issue plaguing me, occasionally php-fpm workers start to pile up, as a result of this so does cpu, connections to sql, and eventually sql cpu – fortunately this never lasts too long.
I am sure this is something remote , as it occurs simultaneously to all servers currently under LB
From my investigation and testing I think I have tracked this down to something in the web layer making the php processes hang.
I believe I can rule out connections to my ES cluster, and also to RDS for several reasons, - Separate monitoring of ES from the specific host having problems shows now issues - All connections to ES/SQL are performed through api layer, api logs show no failed requests (499/502) as I am getting in web logs. - Health check script that runs in php, calls data from ES and SQL from the webserver itself also shows no problems, while at same time the web layer starts to return 499/502 - Further general environment monitoring of SQL and ES show no problems.
It’s also not a sudden increase in connections/ attack – looking back over load balancer metrics show nothing of concern other than increased latency as the issues starts to take effect.
My suspicion is, that part of the php request to the web layer requires it to generates a response that includes data from external sources, some which are occasionally failing to respond and making server response hang.
I need a way to prove (or disprove) this and identify the connections, I have been looking at netstat, possibly wireshark but I could do with some help on determining a command that will highlight either outgoing connection fails , or hangs – just to be able to log any outgoing connections that take over a certain time would be very helpful, if the issues coincide with these logs then I will be on the right track , with some clues.
I know how i can make this kick off as soon as the connections start to timeout , if its not practical to run it continuously.
Hopefully you guys can give me some ideas :)
Thanks