F5 LTM frequently kills processes with SIGKILL

Question

We have a BIP-IP 6400 LTM device that is killing processes with an alarming frequency. The CPU is consistently around 23% utilization, so that is not an issue.

Here is a sample from /var/log/ltm:

Oct  7 08:21:55 local/pri-4600 info bigd[3471]: reap_child: child process PID = 25338 exited with signal = 9
Oct  7 08:22:15 local/pri-4600 info bigd[3471]: reap_child: child process PID = 25587 exited with signal = 9
Oct  7 08:22:34 local/pri-4600 info bigd[3471]: reap_child: child process PID = 25793 exited with signal = 9
Oct  7 08:23:10 local/pri-4600 info bigd[3471]: reap_child: child process PID = 26260 exited with signal = 9
Oct  7 08:23:36 local/pri-4600 info bigd[3471]: reap_child: child process PID = 26584 exited with signal = 9
Oct  7 08:23:40 local/pri-4600 info bigd[3471]: reap_child: child process PID = 26647 exited with signal = 9
Oct  7 08:23:45 local/pri-4600 info bigd[3471]: reap_child: child process PID = 26699 exited with signal = 9
Oct  7 08:23:55 local/pri-4600 info bigd[3471]: reap_child: child process PID = 26805 exited with signal = 9
Oct  7 08:25:36 local/pri-4600 info bigd[3471]: reap_child: child process PID = 28079 exited with signal = 9
Oct  7 08:27:15 local/pri-4600 info bigd[3471]: reap_child: child process PID = 29286 exited with signal = 9
Oct  7 08:27:16 local/pri-4600 info bigd[3471]: reap_child: child process PID = 29307 exited with signal = 9
Oct  7 08:27:56 local/pri-4600 info bigd[3471]: reap_child: child process PID = 29793 exited with signal = 9
Oct  7 08:29:20 local/pri-4600 info bigd[3471]: reap_child: child process PID = 30851 exited with signal = 9
Oct  7 08:33:00 local/pri-4600 info bigd[3471]: reap_child: child process PID = 1122 exited with signal = 9
Oct  7 08:33:16 local/pri-4600 info bigd[3471]: reap_child: child process PID = 1299 exited with signal = 9
Oct  7 08:34:15 local/pri-4600 info bigd[3471]: reap_child: child process PID = 2054 exited with signal = 9
Oct  7 08:35:16 local/pri-4600 info bigd[3471]: reap_child: child process PID = 2784 exited with signal = 9
Oct  7 08:35:16 local/pri-4600 info bigd[3471]: reap_child: child process PID = 2807 exited with signal = 9
Oct  7 08:35:35 local/pri-4600 info bigd[3471]: reap_child: child process PID = 3015 exited with signal = 9
Oct  7 08:36:15 local/pri-4600 info bigd[3471]: reap_child: child process PID = 3601 exited with signal = 9

Is this normal? If not, what could be causing this to happen?

what version of BIG-IP software are you running? hate to say it, but it may be worth an upgrade depending on what you're on. We are running 11.5.1 HF7 and it is very stable - looking to upgrade to 11.6.x/12.x.x soon though for additional bugfixes and features. — Keegan Jacobson, Oct 19 '15 at 19:55

score 1 · Answer 1 · answered Oct 19 '15 at 19:31

bigd is the monitoring daemon on the BIG-IP and so this appears that a monitor that is in use is crashing. You should get a case open with support and upload your qkview to ihealth.f5.com. Here is a solution related to that error message:

https://support.f5.com/kb/en-us/solutions/public/17000/000/sol17092.html

score 1 · Accepted Answer · answered Dec 31 '15 at 17:07

This was a known bug in the 10.2.4 BIG-IP software we were running.

From F5 support:

...you hit a known issue tracked internally as: bug ID539130 "bigd can deadlock while processing SIGCHLD causing bigd heartbeat failure and SIGABRT" -=Condition=- External monitors that run for a long time and are killed by the next iteration of the monitor, may cause bigd to crash and core, this causes a temporary lapse in health monitoring.

The fix was to update the software with Hotfix-BIGIP-10.2.4-HF12-866.11-ENG.

F5 LTM frequently kills processes with SIGKILL

2 Answers2