Bit of a baffling situation I've been dealing with for a few days. I have multiple CentOS headless servers (6.4) with the following stats:
Core
- CentOS - 6.4 (Final)
- Kernel - 2.6.32-358.14.1.el6.x86_64
- FreePBX - 4.211.64-9
- MoBo - Asus P8H61
- CPU - Intel Core i3 3.4GHZ
- Mem - 8GB Kingston DDR3 800-1600
- HDD - WD Black 7200 RPM
- PRI - Digium Device TE130 800a (rev 02)
- PRI - Sangoma B600 (1923:0025)
SE Status : disabled (I know, I know)
Packages
libpri-1.4.12-6_centos6.x86_64
- libpri-debug-info-1.4.12-6_centos6.x86_64
libpridevel--1.4.12-6_centos6.x86_64
dahdi-firmware-oct6114-128-1.05.01-119_centos5.noarch
dahdi-linux-2.7.0-18_centos6.x86_64
- wanpipe-7.0.4-kernel.2.6.32.358.14.1.el6.dahdi.2.7.0.rel.49.x86_64
- dahdi-linux-kmod-debuginfo-2.7.0-45_centos6.2.6.32_358.14.1.el6.x86_64.x86_64
- dahdi-linux-debuginfo-2.7.0-18_centos6.x86_64
- dahdi-firmware-oct6114-032-1.07.01-119_centos5.noarch
- kmod-dahdi-linux-2.7.0-45_centos6.2.6.32_358.14.1.el6.x86_64.x86_64
- dahdi-firmware-oct6114-256-1.05.01-119_centos5.noarch
- dahdi-firmware-te820-1.76-119_centos5.noarch
- dahdi-firmware-vpmoct032-1.12.0-119_centos5.noarch
- dahdi-firmware-2.5.0.1-119_centos5.noarch
- dahdi-linux-devel-2.7.0-18_centos6.x86_64
- dahdi-firmware-xorcom-1.0-1.noarch
- dahdi-tools-debuginfo-2.7.0-37_centos6.x86_64
- dahdi-firmware-oct6126-128-01.07.04-119_centos5.noarch
- dahdi-firmware-oct6114-064-1.05.01-119_centos5.noarch
- dahdi-firmware-hx8-2.06-119_centos5.noarch
- dahdi-firmware-tc400m-MR6.12-119_centos5.noarch
- schmooze-dahdi-1.0.0-2.noarch dahdi-tools-2.7.0-37_centos6.x86_64
- dahdi-tools-doc-2.7.0-37_centos6.x86_64
When this setup works, it works great. Ten servers at different locations running this same setup hardware and software wise. Three out of the ten servers, however keep locking up. By locking up, I mean completely unresponsive on the network, and no phone calls can be sent or received. It takes a hard shutdown/reboot of the server for it to become operational again.
/var/log/messages, dmesg and dmesg,old just stop recording when the system locks up, but no errors, hardware errors, panics, etc are in the logs. /var/log/boot shows a normal startup, just a few warnings about prodigy (that is not used). /var/log/mcelog is always empty, no linecount or text. /var/log/freepbx.log show normal INFO lines.
There is no pattern to the time frame or workload of the servers that correlate to the lock up. Sometimes it will be up for three hours, sometimes for three days. Sensors show temp is always within range, and no CPU threshold logs are recoreded. I've installed kdump and set the kernel params to panic on softlockup and hung task, as well as the defaults. kdump.conf was changed to default reboot. When I manually SYSRQ C (kernel panic), kdump is triggered and dumps a crash file (though for some reason it does not auto reboot after that). SAR usage for cpu is never over 5% utilization, memory is never over 10% utilization. HDD rd_sec peaks at 5.86, wr_sec peaks at 120. Max util has been about 7% average.
I've run memtester and stress on the system, TRYING to make it crash, to no avail (system needs to remain up if at all possible) . Memtester running with 512M and 50 iterations, up to 2048M and 100 iterations, have all tested "ok" no problems.
I cannot see any reason for these boxes locking up, or why kdump isn't being triggered (if it is a kernel panic). I've exhausted my log searching skills in attempts to find a reason for this behavior.
Does anyone else have an idea of where I could look, or what I could do to pinpoint the problem here please?