2

I'm serving Wordpress pages via nginx/PHP5-FPM using APC caching (through the W3 Total Cache plugin). Nginx communicates with PHP-FPM via TCP sockets on port 9000. I've adjusted the number of max connections via sysctl to be 1024. I've set both max_execution_time (in php.ini) and request_terminate_timeout (in FPM conf file) to 30 seconds.

Every now and then (say every 8-10 hours, and not linearly) the number of open TCP connections on port 9000 grows to near 1000 (in CLOSE_WAIT status mostly, some FIN_WAIT, FIN_WAIT_2), surpassing 1000 sometimes, and my web server starts returning 504 errors. Once I kill all TCP connections on that port and restart FPM, it starts working fine again.

I enabled the slow log to see what's going on, and if I'm reading it right, it's hanging on apc_store() calls.

Is this an APC misconfiguration or do I need to tweak with FPM settings? And is there a way to force those TCP connections to terminate even if the script doesn't send the final termination signal?

Example trace from FPM slow log:

[22-Jan-2015 09:42:49]  [pool www] pid 20327
script_filename = /var/www/index.php
[0x00007fdc527ec908] apc_store() /var/www/wp-content/plugins/w3-total-cache/lib/W3/Cache/Apc.php:55
[0x00007fdc527ec768] set() /var/www/wp-content/plugins/w3-total-cache/lib/W3/ObjectCache.php:254
[0x00007fdc527ec5e0] set() /var/www/wp-content/plugins/w3-total-cache/lib/W3/ObjectCache.php:300
[0x00007fdc527ec488] add() /var/www/wp-content/plugins/w3-total-cache/lib/W3/ObjectCacheBridge.php:73
[0x00007fdc527ec330] add() /var/www/wp-content/object-cache.php:94
[0x00007fdc527ec200] wp_cache_add() /var/www/wp-includes/option.php:176
[0x00007fdc527ec078] wp_load_alloptions() /var/www/wp-includes/functions.php:1272
[0x00007fdc527ebf40] is_blog_installed() /var/www/wp-includes/load.php:474
[0x00007fdc527ebdb0] wp_not_installed() /var/www/wp-settings.php:109
[0x00007fdc527ebc88] +++ dump failed
Ansari
  • 176
  • 5
  • Have you tried using Unix Domain Sockets instead of TCP to communicate from Nginx to PHP-FPM? I don't know why you are having this problem with TCP connections, but switching to not use TCP may be the easiest fix. – Moshe Katz Jan 22 '15 at 18:41
  • @MosheKatz I was using Unix sockets initially, but I think a similar situation happened and I moved to TCP sockets because I read that was a more scalable solution. – Ansari Jan 22 '15 at 18:50
  • TCP Sockets are only more scalable if you actually use multiple worker machines and farm the connections out to them. **If you are only using a single machine**, then the extra overhead of encapsulating the data in TCP packets in FPM and deencapsulating it again in Nginx means that domain sockets are more efficient. – Moshe Katz Jan 22 '15 at 18:52
  • @MosheKatz Thanks for clearing that up. I can go back to Unix sockets, but do you think that will address the underlying problem? i.e. the client is not closing connections in time sometimes and the number of open connections piles up ... – Ansari Jan 22 '15 at 19:06
  • I don't know for sure whether it will clear the problem up or not. (That's why this is a comment, not an answer.) It certainly can't hurt to try though. – Moshe Katz Jan 22 '15 at 19:59
  • @MosheKatz It's happening even with Unix sockets :( – Ansari Jan 23 '15 at 00:00
  • I have this problem too. – Philip Mar 02 '15 at 08:36
  • @Philip I ended up going back to TCP sockets and lowering the memory footprint and number of child processes a little. The server is much more stable now, but I still don't think the problem is fully solved. – Ansari Mar 02 '15 at 10:17
  • @Ansari - Thx, I'm testing this now: http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/ Will let you know if this did anything... – Philip Mar 02 '15 at 16:03

1 Answers1

1

It looks like you have a lot of TCP connections that are not terminating correctly. Basically, the CLOSE_WAIT connections means the server has recieved a FIN packet, and is now waiting for either the maximum segment lifetime or the server to tell the network stack to close the socket. The FIN_WAIT messages are due to the fact that the "server" sent a FIN to the client, but the client hasn't yet sent (or rather, the server hasn't received) a FIN-ACK to confirm the connection is closed.

TheCompWiz
  • 7,409
  • 17
  • 23
  • Right, I'm trying to narrow down why that's happening :) Any ideas on drilling down further and narrowing the problem? I think it's the APC store call that's taking too long and causing the client not to terminate in time. – Ansari Jan 22 '15 at 18:14