I've been running a brand new system on arch linux for about 3 weeks, and last night it spontaneously rebooted.
There's no shutdown/halt in journalctl at the time of the reboot, so I'm pretty sure this is hardware related, not a userspace program or acpi.
journalctl
Jul 01 06:21:15 euclid sshd[25731]: ...
-- Reboot --
Jul 01 06:24:46 euclid systemd-journald[305]: Time spent on flushing to /var is 547us for 0 entries.
Then, during the boot,
Jul 01 06:24:46 euclid kernel: .... node #0, CPUs: #1 #2 #3
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: Machine check events logged
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff810b4260 MISC d012000101000000 SYND 4d000000 IPID 500b000000000
Jul 01 06:24:46 euclid kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1498915479 SOCKET 0 APIC 3 microcode 800111c
Jul 01 06:24:46 euclid kernel: #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
When I try to run mcelog, I get
0 % mcelog
mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor. Please use the edac_mce_amd module instead.
CPU is unsupported
I suspect either 1) I undersized the PSU for this system, or 2) overheating somewhere.
All of the PSU calculators I ran gave me a recommendation of 750W, so I went with an 850W PSU. Still, now I'm considering upgrading to a 1000W PSU.
My questions are, how do I interpret that machine check event? I guess it's specific to my CPU? Does AMD put out any information that would enable me to decode that error? And, how would I know if I rebooted due to overheating? I cant find any event log in the BIOS (ASUS).
EDIT : more details
Processor : Ryzen 7 1700
Mobo: Asus Prime x370-Pro
RAM: G.SKILL Trident Z (4x 8GB) 3200 (F4-3200C16D-16GTZKW)
PSU: EVGA SuperNOVA 850 P2 80+ PLATINUM 850W
GPU: GTX 1080-TI x2
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 0604
Release Date: 04/06/2017
No overclocking. Stock BIOS settings.
It ran stably for several weeks. I did add 3x HDD's a couple of days before the event.
EDIT: The same crash appears to have happened again
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: Machine check events logged
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff810b3ef6 MISC d012000101000000 SYND 4d000000 IPID 500b000000000
Jul 06 22:46:37 euclid kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1499406389 SOCKET 0 APIC c microcode 800111c