0

I have been tracking down the source of Apache processes hanging indefinitely. Unfortunately, I need to regularly reboot Apache as it eventually exhausts all slots. The Apache status page below shows processes hung in the W - Sending Reply state - and they will never die and just build up until ServerLimit is reached.

enter image description here

I ran strace -ff -p {pid} on an Apache process until it eventually hung in the "W - Sending Reply" state, and below is the strace output (I've removed non-pertinent strace output):

connect(13, {sa_family=AF_INET, sin_port=htons(11211), sin_addr=inet_addr("XXX")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=13, events=POLLOUT}], 1, 10)  = 1 ([{fd=13, revents=POLLOUT}])
sendto(13, "get mcalls_e7e0891d35db253e26a31"..., 45, MSG_NOSIGNAL, NULL, 0) = 45
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "END\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 5
sendto(13, "get trc-mods.197194\r\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "END\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 5
sendto(13, "set trc-mods.197194 5 1659718696"..., 2799, MSG_NOSIGNAL, NULL, 0) = 2799
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "STORED\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 8
sendto(13, "get uonline_1354585\r\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "END\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 5
sendto(13, "get grps_1354585\r\n", 18, MSG_NOSIGNAL, NULL, 0) = 18
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "VALUE grps_1354585 5 6\r\n\0\0\0\2\24\0\r\n"..., 8196, MSG_NOSIGNAL, NULL, NULL) = 37
sendto(13, "get trc-mods.197129\r\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "END\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 5
sendto(13, "set trc-mods.197129 5 1659718696"..., 2308, MSG_NOSIGNAL, NULL, 0) = 2308
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "STORED\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 8
sendto(13, "get trc-modcats.8279\r\n", 22, MSG_NOSIGNAL, NULL, 0) = 22
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(13, "set trc-modcats.8279 5 165971869"..., 318, MSG_NOSIGNAL, NULL, 0) = 318
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(13, "get trc-mods.211005\r\n", 21, MSG_NOSIGNAL, NULL, 0) = 21
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "VALUE trc-modcats.8279 5 277\r\n\0\0"..., 8196, MSG_NOSIGNAL, NULL, NULL) = 314
sendto(13, "set trc-mods.211005 5 1659718696"..., 2149, MSG_NOSIGNAL, NULL, 0) = 2149
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 1 ([{fd=13, revents=POLLIN}])
recvfrom(13, "STORED\r\nEND\r\nSTORED\r\n", 8196, MSG_NOSIGNAL, NULL, NULL) = 21
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)
getsockopt(13, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(13, 0x556c1ea68810, 8196, MSG_NOSIGNAL, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout)

Based off this output, I believe the issue relates to memcached.

/var/log/messages shows these errors accumulating around the time of the hung process build up:

Aug  5 15:27:43 myhost memcached[3752802]: Failed to write, and not due to blocking: Broken pipe
Aug  5 15:27:43 myhost memcached[3752802]: Failed to write, and not due to blocking: Broken pipe
Aug  5 15:28:02 myhost memcached[3752802]: Failed to write, and not due to blocking: Broken pipe

Another thing I've noticed is when these hung processes build up, they build up at the same time on all 3 Apache servers (See below). The cliffs represent Apache reboots, and the hung processes are just beginning to build up again at the time of the screenshot.

enter image description here

Has anyone encountered this, or have any suggestions on what I can do?

Specs:

  • AlmaLinux 8.5
  • Apache 2.4.37
  • PHP 8.1.4 with PECL Memcached 3.2.0 and libmemcached-awesome 1.1.1
  • memcached 1.6.15 (daemon)
tom_nb_ny
  • 51
  • 1
  • 9
  • Althou this does not answer your question, but you should try php-fpm instead of running php as an apache/httpd module. – Orphans Aug 08 '22 at 12:03
  • 1
    Thanks for the comment. I am considering this, however I have noticed php background tasks launched via crontab also occasionally hanging indefinitely for the same reason, so I don't think there is a guarantee switching to php-fpm will eliminate the problem. In a way it is also may be easier to reboot httpd via cronjob every 2 hours than try to kill hung php-fpm processes. – tom_nb_ny Aug 08 '22 at 12:31

1 Answers1

1

After a week of trial and error, I think I finally resolved the issue, though I say this cautiously as I monitor the Apache status page for hung processes, because I've tried so many things that haven't worked.

My solution doesn't address the underlying problem which I believe may be a bug in libmemcached or php-pecl-memcached.

My solution was to buffer memcached set() and delete() calls within my application and run them at once at script shutdown using register_shutdown_function(), rather than littering them throughout various classes and running them in real-time alongside get() calls.

There are a few gotchas to this strategy (race conditions etc), so if you try it, test your code carefully. I should add that this approach makes a good opportunity to use setMulti(), however, apparently this method was never properly implemented in php-pecl-memcached (it just runs a loop of set calls).

This discussion was really helpful to understand what might have been happening in my case.

tom_nb_ny
  • 51
  • 1
  • 9