How to test Linux server for hardware errors?

Question

I have a Debian 10 server that is randomly rebooting, though no error were written to journald. The server has rebooted 20 times in last 3 days.

$ journalctl --list-boots
-22 bdb1799f0c9a4e81af6d41b0bd6c5cd9 Tue 2023-01-17 12:42:00 UTC—Sat 2023-01-21 22:01:24 UTC
...
 -2 e306cc0481784a0cad5e7138b0fcfcdb Mon 2023-01-23 13:18:52 UTC—Mon 2023-01-23 13:28:54 UTC
 -1 e4ca2701610640cfb11c39c38d05c091 Mon 2023-01-23 13:32:02 UTC—Mon 2023-01-23 13:34:27 UTC
  0 d5c51684dc6e4538a241216f400d9ca7 Tue 2023-01-24 10:23:51 UTC—Tue 2023-01-24 13:10:04 UTC

Usually I run memtester which takes a couple of hours (depending on RAM size) and it's quite unlikely to actually reproduce the issue (if it really is memory).

$ apt install memtester
$ memtester 245GB 4 > memtester.log 2>&1

My server has 256GB RAM, in 16 RAM modules:

$ dmidecode -t memory | grep Size | wc -l
16

free  -h
             total       used       free     shared    buffers     cached
Mem:          251G        32G       218G       113M         0B       135M
-/+ buffers/cache:        32G       219G
Swap:           0B         0B         0B

DDR3 modules:

Handle 0x002D, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1600 MHz
        Manufacturer: Hynix Semiconducto
        Serial Number: 093C2E1C          
        Asset Tag: Dimm0_AssetTag
        Part Number: HMT42GR7AFR4C-RD
        Rank: 2
        Configured Clock Speed: 1600 MHz

UPDATE: The system should have ECC memory modules (seems to be detected in dmidecode -t memory)

Handle 0x002B, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 512 GB
        Error Information Handle: Not Provided
        Number Of Devices: 8

After replacing all memory modules the system shows EDAC MC0 errors (I haven't seen those before)

Jan 24 14:47:07 kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jan 24 15:00:13 kernel: perf: interrupt took too long (3174 > 3158), lowering kernel.perf_event_max_sample_rate to 63000
Jan 24 15:19:20 kernel: perf: interrupt took too long (3984 > 3967), lowering kernel.perf_event_max_sample_rate to 50000
Jan 24 16:01:03 kernel: perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
Jan 24 17:43:25 kernel: perf: interrupt took too long (6233 > 6228), lowering kernel.perf_event_max_sample_rate to 32000
Jan 24 19:02:54 kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 19:02:54 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 24 19:02:54 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004f000800c1
Jan 24 19:02:54 kernel: EDAC sbridge MC0: TSC 2fe1a1819026 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: ADDR 1ff0136000 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: MISC 908400400041e8c 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1674586974 SOCKET 0 APIC 0
Jan 24 19:02:54 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff0136 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)

UPDATE 2 I've tried disabling edac kernel module, as suggested by RedHat/Suse in order to rule out possibility that the module is in conflict with hardware correction on motherboard

echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf

This seems to prevent reboots, but memory allocation is failing (on workload). All memtests still passing.

Hardware name: Supermicro X9DRFR/X9DRFR, BIOS 3.2 01/16/2015
Call Trace:
 dump_stack+0x66/0x81
 dump_header+0x6b/0x283
 ? ___ratelimit+0xa1/0x100
 oom_kill_process.cold.30+0xb/0x1cf
 out_of_memory+0x1a5/0x450
 mem_cgroup_out_of_memory+0xbe/0xd0
 try_charge+0x707/0x780
 mem_cgroup_try_charge+0x86/0x190
 __add_to_page_cache_locked+0x64/0x240
 add_to_page_cache_lru+0x4a/0xe0
 filemap_fault+0x34c/0x780
 ? filemap_map_pages+0x1ed/0x3a0
 ext4_filemap_fault+0x2c/0x40 [ext4]
 __do_fault+0x36/0x170
 __handle_mm_fault+0xdb6/0x11b0
 handle_mm_fault+0xd6/0x200
 __do_page_fault+0x249/0x4f0
 ? page_fault+0x8/0x30
 page_fault+0x1e/0x30
RIP: 0033:0x7f1e1d58ff9d
Code: Bad RIP value.
RSP: 002b:00007fff6a4fd3d8 EFLAGS: 00010202
RAX: 00007f1e183501e0 RBX: 00007f10cbf0a638 RCX: 0000000000000040
RDX: 0000000000000006 RSI: 00007f1e183501e6 RDI: 00007f10cbf0a626
RBP: 00007f10cbf0b3e8 R08: 0000000000000006 R09: 0000000000000007
R10: c2bdb975b17afafd R11: 00007f1e1d5b6060 R12: 00007f1e183501b0
R13: 0000000000000005 R14: 00007f10cbf093c0 R15: 00007f10cbf0b3c8
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 101eeb22ce3e ADDR 1ff19b6000 MISC 908400400041e8c 
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674617922 SOCKET 0 APIC 0 microcode 428
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 19a7daf91fd4 ADDR 1ff19b6000 MISC 908400400041e8c 
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674621954 SOCKET 0 APIC 0 microcode 428

can you be more specific rebooting? crashing and restarting? powering off and on? could it be a power supply issue (ups fault perhaps?) — SEWTGIYWTKHNTDS, Jan 24 '23 at 14:19
I'm trying to rule out all possibilities. Technicians have checked the power supply, it looks ok. The only suspicious messages are `kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000` — Tombart, Jan 24 '23 at 15:57
is it old? I had a system reboot and it was because the thermal paste on the cpu cooler had dried out and the cpu was overheating. Another server didn't like the UPS Self test, a firmware update sorted that one but your frequency seems too high for that. I see interrupt too long on lots of systems so probably not significant. Malicious user? Hope you sort it soon.. — SEWTGIYWTKHNTDS, Jan 24 '23 at 16:30
I've installed the system 2 weeks ago, cooling seems to be working fine. The motherboard is Supermicro `X9DRFR`. — Tombart, Jan 24 '23 at 20:20
@Davidw I'm unable to get into BIOS, but the technicians tried update BIOS and check configuration. The server has been passing all hardware tests running for days. — Tombart, Feb 13 '23 at 13:53
Supermicro servers have an IPMI BMC with its own network connection (sometimes a dedicated port, sometimes shared with the NIC 1) and it has its own hardware error log. What's in that log? Also you can get that from the OS using `ipmitool` or `ipmiutil` package (Debian has them both), try `sel` command. Better use `ipmiutil` (I've seen cases when it decoded messages way better). — Nikita Kipriyanov, Feb 13 '23 at 15:34

score 1 · Answer 1 · answered Jan 24 '23 at 13:36

1

Have you tried booting from https://www.memtest86.com/ - it's always been great for me.

answered Jan 24 '23 at 13:36

Chopper3

101,299
9
108
239

Not yet, I have ssh access to a booted OS. Unfortunately booting custom image is not possible in this case. Is the `memtest86` algorithm very different from `memtester`? – Tombart Jan 24 '23 at 15:52
It boots from the tester ISO, so you've no OS in the way. – Chopper3 Jan 24 '23 at 19:10
Yes, I know. I can only install/compile packages in provided rescue system. I don't have physical access to the server. AFAIK it's not possible to install `memtest86` as a package. – Tombart Jan 24 '23 at 19:54
If you have no control over hardware and suspect a hardware problem, this is not your problem. Hand it over to the person who is in charge of the hardware. – Nikita Kipriyanov Feb 13 '23 at 15:45

How to test Linux server for hardware errors?

1 Answers1