FreePBX server (cent OS base) locks up with no errors or kernel panics

Question

Bit of a baffling situation I've been dealing with for a few days. I have multiple CentOS headless servers (6.4) with the following stats:

Core

CentOS - 6.4 (Final)
Kernel - 2.6.32-358.14.1.el6.x86_64
FreePBX - 4.211.64-9
MoBo - Asus P8H61
CPU - Intel Core i3 3.4GHZ
Mem - 8GB Kingston DDR3 800-1600
HDD - WD Black 7200 RPM
PRI - Digium Device TE130 800a (rev 02)
PRI - Sangoma B600 (1923:0025)
SE Status : disabled (I know, I know)

Packages
libpri-1.4.12-6_centos6.x86_64
libpri-debug-info-1.4.12-6_centos6.x86_64
libpridevel--1.4.12-6_centos6.x86_64
dahdi-firmware-oct6114-128-1.05.01-119_centos5.noarch
dahdi-linux-2.7.0-18_centos6.x86_64
wanpipe-7.0.4-kernel.2.6.32.358.14.1.el6.dahdi.2.7.0.rel.49.x86_64
dahdi-linux-kmod-debuginfo-2.7.0-45_centos6.2.6.32_358.14.1.el6.x86_64.x86_64
dahdi-linux-debuginfo-2.7.0-18_centos6.x86_64
dahdi-firmware-oct6114-032-1.07.01-119_centos5.noarch
kmod-dahdi-linux-2.7.0-45_centos6.2.6.32_358.14.1.el6.x86_64.x86_64
dahdi-firmware-oct6114-256-1.05.01-119_centos5.noarch
dahdi-firmware-te820-1.76-119_centos5.noarch
dahdi-firmware-vpmoct032-1.12.0-119_centos5.noarch
dahdi-firmware-2.5.0.1-119_centos5.noarch
dahdi-linux-devel-2.7.0-18_centos6.x86_64
dahdi-firmware-xorcom-1.0-1.noarch
dahdi-tools-debuginfo-2.7.0-37_centos6.x86_64
dahdi-firmware-oct6126-128-01.07.04-119_centos5.noarch
dahdi-firmware-oct6114-064-1.05.01-119_centos5.noarch
dahdi-firmware-hx8-2.06-119_centos5.noarch
dahdi-firmware-tc400m-MR6.12-119_centos5.noarch
schmooze-dahdi-1.0.0-2.noarch dahdi-tools-2.7.0-37_centos6.x86_64
dahdi-tools-doc-2.7.0-37_centos6.x86_64

When this setup works, it works great. Ten servers at different locations running this same setup hardware and software wise. Three out of the ten servers, however keep locking up. By locking up, I mean completely unresponsive on the network, and no phone calls can be sent or received. It takes a hard shutdown/reboot of the server for it to become operational again.

/var/log/messages, dmesg and dmesg,old just stop recording when the system locks up, but no errors, hardware errors, panics, etc are in the logs. /var/log/boot shows a normal startup, just a few warnings about prodigy (that is not used). /var/log/mcelog is always empty, no linecount or text. /var/log/freepbx.log show normal INFO lines.

There is no pattern to the time frame or workload of the servers that correlate to the lock up. Sometimes it will be up for three hours, sometimes for three days. Sensors show temp is always within range, and no CPU threshold logs are recoreded. I've installed kdump and set the kernel params to panic on softlockup and hung task, as well as the defaults. kdump.conf was changed to default reboot. When I manually SYSRQ C (kernel panic), kdump is triggered and dumps a crash file (though for some reason it does not auto reboot after that). SAR usage for cpu is never over 5% utilization, memory is never over 10% utilization. HDD rd_sec peaks at 5.86, wr_sec peaks at 120. Max util has been about 7% average.

I've run memtester and stress on the system, TRYING to make it crash, to no avail (system needs to remain up if at all possible) . Memtester running with 512M and 50 iterations, up to 2048M and 100 iterations, have all tested "ok" no problems.

I cannot see any reason for these boxes locking up, or why kdump isn't being triggered (if it is a kernel panic). I've exhausted my log searching skills in attempts to find a reason for this behavior.

Does anyone else have an idea of where I could look, or what I could do to pinpoint the problem here please?

@MichaelHampton They where installed prior to my taking over this network. Up till the past week they've been working and the consensus has been 'don't fix what's not broken'. 7 out of 10 servers work with this config. — PenguinCoder, Jul 14 '14 at 22:03
Crash dumps really are enabled. `cat /proc/cmdline - ro root=UUID=37df6fc6-9cd6-4078-864a-94735e5f9e27 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=129M@0M KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet acpi=off apm=off softlockup_panic=1 hung_task_panic=1 console=tty0 console=ttyS0,9600n8`, and `service kdump status - Kdump is operational` and `cat /sys/kernel/kexec_crash_loaded - 1`. If/when I force a panic via SYSRQ + C, kdump does what is expected — PenguinCoder, Jul 14 '14 at 22:27

FreePBX server (cent OS base) locks up with no errors or kernel panics

0 Answers0