Apache2 stuck after 8-12 days uptime. No errors. Stuck in null loop and Recv-Q

Question

I have some fresh web servers with ubuntu 22 LTS runnings apache2 and php-fpm with mpm_event.

All of them behave the same. After 8-12 days of apache uptime. It suddenly stops recieving requests untill i restart apache2 manually. Then it run fine again for another 10 days.

It crashes at random times

Here is output from netstat Apache seems to only listen on tcp6? Seems wrong.. Should it not also listen on normal tcp?

netstat -tulpn

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      729/systemd-resolve 
tcp        0      0 0.0.0.0:22825           0.0.0.0:*               LISTEN      2728/php            
tcp        0      0 0.0.0.0:23004           0.0.0.0:*               LISTEN      2752/php            
tcp        0      0 0.0.0.0:22928           0.0.0.0:*               LISTEN      2742/php            
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1072/sshd: /usr/sbi 
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      971/mysqld          
tcp        0      0 127.0.0.1:33060         0.0.0.0:*               LISTEN      971/mysqld          
tcp6       0      0 :::9100                 :::*                    LISTEN      1074312/node_export 
tcp6      71      0 :::80                   :::*                    LISTEN      2207/apache2        
tcp6       0      0 :::22                   :::*                    LISTEN      1072/sshd: /usr/sbi 
tcp6     512      0 :::443                  :::*                    LISTEN      2207/apache2        
udp        0      0 127.0.0.53:53           0.0.0.0:*                           729/systemd-resolve 
udp        0      0 0.0.0.0:500             0.0.0.0:*                           1073/charon         
udp        0      0 0.0.0.0:4500            0.0.0.0:*                           1073/charon         
udp6       0      0 :::500                  :::*                                1073/charon         
udp6       0      0 :::4500                 :::*                                1073/charon

I also did a stack trace.

strace -o apache.strace -f -p 2207

output:

2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843146
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843246
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843346
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843446
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843546
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843646
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843747
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843847
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0
2207  times({tms_utime=3026 /* 30.26 s */, tms_stime=6010 /* 60.10 s */, tms_cutime=369020 /* 3690.20 s */, tms_cstime=174724 /* 1747.24 s */}) = 1819843947
2207  pselect6(0, NULL, NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
2207  wait4(-1, 0x7ffedc77b084, WNOHANG|WSTOPPED, NULL) = 0

It just loops like that forever.

Here is iptables setup

-P INPUT DROP
-P FORWARD DROP
-P OUTPUT ACCEPT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 80 -m conntrack --ctstate NEW -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 443 -m conntrack --ctstate NEW -j ACCEPT
-A INPUT -i eth0 -p tcp -m multiport --dports 22345:25000 -m conntrack --ctstate NEW -j ACCEPT

ip6tables:

-P INPUT DROP
-P FORWARD DROP
-P OUTPUT ACCEPT
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p ipv6-icmp -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 80 -m conntrack --ctstate NEW -j ACCEPT
-A INPUT -i eth0 -p tcp -m tcp --dport 443 -m conntrack --ctstate NEW -j ACCEPT
-A INPUT -i eth0 -p tcp -m multiport --dports 22345:25000 -m conntrack --ctstate NEW -j ACCEPT

Hello @ezra-s Would you elaborate your question? I highly doubt the issue is caused by iptables. It seems a lot of connections gets stuck under 1 apache process which eats up all connections eventually. — Aidvi, Oct 05 '22 at 14:30
In an issue supposedly about httpd you bring up your iptables setup and no info about mod_status, error logs about httpd, only a traice showing timeouts. That's why I told you try and disable iptables to check and discard as the problem, because that's what you are bringing up, if you are sure it is not iptables why show them? why not show anything in the httpd side? Also, are you sure it is not the php-fpm side either? The side what usually would hold apache workers waiting for a response if php scripts are too slow or non responsive? — Daniel Ferradal, Oct 06 '22 at 15:21
I included it, because i suspected it to be something with only listening on tcp6, so it could be an iptable fault. Another answer stated that tcp6 means both ipv4 and ipv6 so that concludes it. As i said in the post. There are no Errors, nothing in logs that are looking wrong. Could you elaborate on mod_status? The only thing i found was with netstat and stacktrace on the single process active at the time. This problem builds up over multiple days. I checked php-fpm logs and restarted php-fpm while the problem was active. No luck. If you have some more debugging i can do, please share. — Aidvi, Oct 14 '22 at 13:34
in mod_status you can see what workers getting stuck are doing, load mod_status and activate it through configuration in your virtualhost and check. — Daniel Ferradal, Oct 15 '22 at 15:31

drookie · Answer 1 · 2022-11-18T13:30:39.073

1

This is a classic symptom of worker exhaustion in a PHP stack. It usually means that apache workers running php via a module are blocking while waiting for something, for instance (most usual, but not the only cause) - somewhere in the code there's a curl request to an external web server without the timeout being set explicitly to some low and reasonablt value, say 10 seconds). As soon as all the workers get stuck in a waiting state, apache hit the limit for forking for workers and stops handling new connections.

This situation is not specific for Apache running PHP as a module - it can happen in various stacks, including for example nginx+php-fpm one.

P.S. When Linux kernel has a socket that some application listens on both ipv4 and ipv6 address family, it's shown as being listened on tcp6.

edited Nov 18 '22 at 13:30

answered Sep 29 '22 at 11:35

drookie

8,625
1
19
29

Thank you for an informating answer. My question: If we talk worker exhaustion, would'nt it show up on Monitor like "top"? or take up a lot of CPU usage / RAM?. Right now i have a test server in this stuck state. I can post more information if needed. – Aidvi Sep 29 '22 at 11:44
This may not be the case: usually workers just block on some external call (imagine curl call to some external service that is not answering, with timeout unset) and thus they don't consume CPU, only wait for the answer. Whather worker exhaustion happened to you can be easily calculated by taking a sum of all the apache children and comparing it to it's limits in it's config file. – drookie Nov 18 '22 at 13:33
Thank you, What i did to tempoary fix the issue was to restart apache every night at 2 AM. The problem have not happend since. I still have a test server i can spin up and it should run out of connections after 2 weeks. I should be able to see if the apache children matches the config file. – Aidvi Nov 18 '22 at 14:06

Apache2 stuck after 8-12 days uptime. No errors. Stuck in null loop and Recv-Q

1 Answers1