spontaneous reboot, machine check events, AMD ryzen

Question

I've been running a brand new system on arch linux for about 3 weeks, and last night it spontaneously rebooted.

There's no shutdown/halt in journalctl at the time of the reboot, so I'm pretty sure this is hardware related, not a userspace program or acpi.

journalctl
Jul 01 06:21:15 euclid sshd[25731]: ...
-- Reboot --
Jul 01 06:24:46 euclid systemd-journald[305]: Time spent on flushing to /var is 547us for 0 entries.

Then, during the boot,

Jul 01 06:24:46 euclid kernel: .... node  #0, CPUs:        #1  #2  #3
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: Machine check events logged
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff810b4260 MISC d012000101000000 SYND 4d000000 IPID 500b000000000 
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1498915479 SOCKET 0 APIC 3 microcode 800111c
Jul 01 06:24:46 euclid kernel:   #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15

When I try to run mcelog, I get

0 % mcelog
mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
CPU is unsupported

I suspect either 1) I undersized the PSU for this system, or 2) overheating somewhere.

All of the PSU calculators I ran gave me a recommendation of 750W, so I went with an 850W PSU. Still, now I'm considering upgrading to a 1000W PSU.

My questions are, how do I interpret that machine check event? I guess it's specific to my CPU? Does AMD put out any information that would enable me to decode that error? And, how would I know if I rebooted due to overheating? I cant find any event log in the BIOS (ASUS).

EDIT : more details

Processor : Ryzen 7 1700

Mobo: Asus Prime x370-Pro

RAM: G.SKILL Trident Z (4x 8GB) 3200 (F4-3200C16D-16GTZKW)

PSU: EVGA SuperNOVA 850 P2 80+ PLATINUM 850W

GPU: GTX 1080-TI x2

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 0604
Release Date: 04/06/2017

No overclocking. Stock BIOS settings.

It ran stably for several weeks. I did add 3x HDD's a couple of days before the event.

EDIT: The same crash appears to have happened again

Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: Machine check events logged
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff810b3ef6 MISC d012000101000000 SYND 4d000000 IPID 500b000000000 
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1499406389 SOCKET 0 APIC c microcode 800111c

Is your board BIOS up to date? It the chip overclocked? I have a 1800X on an ASUS Prime X370 Pro and it rebooted over and over on Windows 10. There was an BIOS update from Asus that fixed the problem. What cooler do you have? What MB? What RAM? There is a lot that could cause this, but BIOS is the first place to start. — Gmck, Jul 01 '17 at 20:23

score 2 · Answer 1 · answered Jul 11 '17 at 18:53

2

It seems this is a CPU hardware problem. In the AMD community forums (https://community.amd.com/thread/215773) it was suggested to either disable SMT or OpCache as a workaround until this gets fixed.

I disabled OpCache in the BIOS and the mce: [Hardware Error] messages during boot-up disappeared. I have two identical systems, which had the same issue with the freezes/reboots. Until now both systems didn't freeze.

answered Jul 11 '17 at 18:53

mpreiner

21
2

Both machines are still up and running. Usually the freeze/reboot occurred after 1 day. – mpreiner Jul 13 '17 at 17:52

spontaneous reboot, machine check events, AMD ryzen

1 Answers1