2

During past month one of my Debian Squeeze (Linux 2.6.32-bpo.5-amd64) machines did lock up twice, hard. No response to ARP, dark console, Caps Lock, Num Lock not working, Magic SysRq ineffective. Changing the kernel to 3.2.0-0.bpo.2-amd64 from backports didn't help either.

Temperature and load monitoring doesn't show any spikes before crash.

How should I diagnose and debug such problem?

Is netconsole my only bet?

EDIT: I've already disabled screen blanking:

#/etc/console-tools/config
BLANK_TIME=0
POWERDOWN_TIME=0

and

setterm -blank 0

on physical console.

UPDATE:

This time it locked, the screen was still showing login prompt. Since last problems I've run a 6h load test with BOINC (Prime 95) test without any problem.

Hubert Kario
  • 6,361
  • 6
  • 36
  • 65
  • ECC uncorrectable errors in the system logs, by any chance? – womble Jul 25 '12 at 09:41
  • Yes, either netconsole or a serial console could help. Testing of physical memory with memtest86 could also be helpful. Lastly, if the server is connected to a managed switch, could you find out if there are errors on the ethernet interface where the server is connected? I had these kind of crashes recently too, and I suspect it to be a bug in a network driver. – AndreasM Jul 25 '12 at 09:44
  • Memory is rather unlikely, the hardware was re-purposed (simple disk swap, case wasn't even opened) and was rock stable as a XenServer machine. It looks like ECC is disabled (?!), at least that's what EDAC module says, I'll look up to it. – Hubert Kario Jul 25 '12 at 10:29
  • Do you have a Broadcom ethernet NIC in the server? If so, what was your MTU set to? – Mike Pennington Jul 25 '12 at 11:14
  • 1
    @MikePennington: No, no Broadcom, I know about them... – Hubert Kario Jul 25 '12 at 11:50
  • @HubertKario, good for you... [I had to learn the hard way :-)](http://unix.stackexchange.com/a/38100/6766) – Mike Pennington Jul 25 '12 at 11:54
  • @womble: Can't enable ECC, I'm using DDR3 memory with squeeze kernel (2.6.32-bpo.5-amd64), and support for DDR3 ECC was merged in 2.6.33. I'll test with backports. – Hubert Kario Jul 26 '12 at 17:42
  • I have a Lenovo desktop running debian 7 with VMWare MWorkstation (my NAS box) exhibiting the same issue. If I run 2 VM's under any sort of a load the server locks up more frequently. I could try running different memory configs but it would be great to know how do go about diagnosing the kernel issue. – kkron Feb 02 '14 at 03:22
  • @kkron: As I've said in the answer below http://serverfault.com/a/444748/55663 It was caused by a hardware problem, most probably the CPU or north bridge. – Hubert Kario Feb 02 '14 at 16:29

2 Answers2

0

I've found two possible solutions, I'll report if they worked. EDIT: They didn't

First is nmi_watchdog enabled by adding nmi_watchdog=1 to kernel boot parameters.

The second one (thanks @womble for the suggestion) was forcing ECC on by

modprobe amd64_edac_mod ecc_enable_override=1 edac_op_state=1

Unfortunately, support for ECC DDR3 memory in 2.6.32-bpo.5-amd64 (Debian squeeze) kernel is absent, I had to use 3.2 from backports.

I also added those options to general kernel parameters:

echo options amd64_edac_mod ecc_enable_override=1 edac_op_state=1 > /etc/modprobe.d/amd64_edac_mod.conf
Hubert Kario
  • 6,361
  • 6
  • 36
  • 65
0

As the hangs were happening more and more often, the problem was probably caused by faulty mainboard or less likely, the CPU. After replacing those components the problems went away.

Hubert Kario
  • 6,361
  • 6
  • 36
  • 65