I have a multithread linux application running on a linux system
The application has been working successfully in different Linux systems and kernels without never noticing this issue.
We are currenlty using this kernel
#ulimits -a
Linux AM38 4.9.0-8-rt-amd64 #1 SMP PREEMPT RT Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
We have been using this kernel for 1 year now without problems.
The application can have some clients that connects externally. When a client connects a couple of threads per client get created.
Recently I've run into a problem that pthread_create return EAGAIn. I managed to design a stress test that reproduces the failure. It takes 2 hours to reproduce. A similar amount of time that took to fail in production.
Once I was able to reproduce the problem, I have gone back to versions used in production without problems but the problem now arises on the old versions too. So I think we always had the issue but we have now an user case that highlights the problem.
Basically the test simulates cuts in communications for 30 seconds, so all the clients gets disconnected, and then I let the system works normally for another 30 seconds, for the clients to reconnect. I added a 450 ms latency to make the stress even further when trying to reconnect. There are 30 clients only.
In production and in my stress lab conditions the problem appears after 2h of start stressing the system.
I've checked for zombies to be sure that I was joining the threads properly. htop or ps never show any thread as Z or defunct.
I have monitored the system with htop and I never saw more than 46 tasks and 140 threads total in the system.
I've checked system limits and look ok.
# ulimit -a
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31414
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 31414
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
# cat /proc/sys/kernel/pid_max
327687
# cat /proc/sys/kernel/threads-max
62828
# free -h
total used free shared buff/cache available
Mem: 7.7G 470M 3.7G 83M 3.5G 7.0G
Swap: 0B 0B 0B
If I do a ps it looks like
# ps -axH | grep myapplication
3910 tty5 Sl+ 4:23 myapplication -v
3910 tty5 Sl+ 1:41 myapplication -v
3910 tty5 Sl+ 0:02 myapplication -v
3910 tty5 Sl+ 0:00 myapplication -v
3910 tty5 Sl+ 0:46 myapplication -v
.... same looking lines here
3910 tty5 Sl+ 0:00 myapplication -v
3910 tty5 Sl+ 0:47 myapplication -v
3910 tty5 Sl+ 0:00 myapplication -v
3910 tty5 Sl+ 0:48 myapplication -v
3910 tty5 Sl+ 0:00 myapplication -v
3910 tty5 Sl+ 0:49 myapplication -v
3910 tty5 Sl+ 0:51 myapplication -v
total threads: 134
I can connect to the system and execute programs and the web server on the system runs. Only that process seems to fail.
If I stop/start that process all goes back to normal for another 2h.
Here pthread_create fails with EAGAIN I've found that I might be hitting this bug
https://bugzilla.kernel.org/show_bug.cgi?id=154011
But I don't know how to confirm it and what to do to solve my issue. It does not seems it is fixed.
Suggestions?