0

I admin a busy webserver which uses nginx/php-fpm , connects out to mysql(RDS) and elasticsearch, but also many third parties for advertising and other plugins etc are on the site(unfortunately I’m not aware of all the specifics).

I have a random, and intermittent issue plaguing me, occasionally php-fpm workers start to pile up, as a result of this so does cpu, connections to sql, and eventually sql cpu – fortunately this never lasts too long.

I am sure this is something remote , as it occurs simultaneously to all servers currently under LB

From my investigation and testing I think I have tracked this down to something in the web layer making the php processes hang.

I believe I can rule out connections to my ES cluster, and also to RDS for several reasons, - Separate monitoring of ES from the specific host having problems shows now issues - All connections to ES/SQL are performed through api layer, api logs show no failed requests (499/502) as I am getting in web logs. - Health check script that runs in php, calls data from ES and SQL from the webserver itself also shows no problems, while at same time the web layer starts to return 499/502 - Further general environment monitoring of SQL and ES show no problems.

It’s also not a sudden increase in connections/ attack – looking back over load balancer metrics show nothing of concern other than increased latency as the issues starts to take effect.

My suspicion is, that part of the php request to the web layer requires it to generates a response that includes data from external sources, some which are occasionally failing to respond and making server response hang.

I need a way to prove (or disprove) this and identify the connections, I have been looking at netstat, possibly wireshark but I could do with some help on determining a command that will highlight either outgoing connection fails , or hangs – just to be able to log any outgoing connections that take over a certain time would be very helpful, if the issues coincide with these logs then I will be on the right track , with some clues.

I know how i can make this kick off as soon as the connections start to timeout , if its not practical to run it continuously.

Hopefully you guys can give me some ideas :)

Thanks

Anthony
  • 11
  • 1
  • 5

1 Answers1

0

The only way you'll be able to get the required data is by doing a packet capture, with full packet details. Something like:

$ tcpdump -s0 -w packet.cap port 80 or port 443

Warning, this will consume disk space, so ensure you have plenty of storage available for the packet capture. After running this through a period where the issue is observed, copy the file locally and examine it using wireshark. You'll be able to examine full TCP flows and HTTP calls/responses, both those initiated by clients and those initiated by your server.

I would ask, though...are you certain that your server is actually questing these external resources and then serving them to its clients? In the vast majority of cases, ad networks and the like serve directly to clients' browsers and not through your web server.

EEAA
  • 109,363
  • 18
  • 175
  • 245
  • Thanks for the advice, i guess that is what im reduced to. I'm not entirely sure that these resources are being pulled via the server(in fact i know a lot are not), but something is making the process hang up occasionally and randomly, im trying to find any evidence to identify the cause the cause, with the lack of errors, unusual metrics or issues with connected services its one more thing i'd like to rule out. The site also connects to sugarcrm on the backend , my initial suspicion was this, but i have not seen any correlating issues there – Anthony Feb 13 '17 at 09:26