3

I have a f23 linux box running as a dev server, and several times over the last few weeks I've come to log into it and found that it had been reset. One time it rebooted right in front of me, and appeared to reset to the BIOS, and then power up again.

This seems to happen about once every 2 or 3 days. The server log show only normal operations, cron etc, until it resets and reboots;

https://paste.fedoraproject.org/518600/33737531/

Jan 01 20:01:02 pc03.config run-parts[19540]: (/etc/cron.hourly) starting mcelog.cron
Jan 01 20:01:02 pc03.config run-parts[19544]: (/etc/cron.hourly) finished mcelog.cron
Jan 01 20:09:10 pc03.config puppet-agent[19565]: Applied catalog in 0.03 seconds
-- Reboot --
Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G).
Jan 01 20:17:57 pc03.config systemd-journal[372]: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.6G available → current limit 1.5G).
Jan 01 20:17:57 pc03.config kernel: Linux version 4.8.13-100.fc23.x86_64 (mockbuild@bkernel02.phx2.fedoraproject.org) (gcc version 5.3.1 20160406 (Red Hat 5.3.1-6) (GCC) ) #1 SMP Fri Dec 9 14:51:40 UTC 2016
Jan 01 20:17:57 pc03.config kernel: Command line: BOOT_IMAGE=/vmlinuz-4.8.13-100.fc23.x86_64 root=/dev/mapper/fedora_pc03-root ro rd.lvm.lv=fedora_pc03/root rd.lvm.lv=fedora_pc03/swap rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off LANG=en_GB.UTF-8
Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Jan 01 20:17:57 pc03.config kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

However there seem to be lots of these messages in the journal;

Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: It has been corrected by h/w and requires no further action
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]: event severity: corrected
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:  Error 0, type: corrected
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:  fru_text: CorrectedErr
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   section_type: PCIe error
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   port_type: 0, PCIe end point
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   version: 0.0
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   command: 0xffff, status: 0xffff
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   device_id: 0000:80:02.3
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   slot: 0
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   secondary_bus: 0x00
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   vendor_id: 0xffff, device_id: 0xffff
Jan 01 17:05:20 pc03.config kernel: {680}[Hardware Error]:   class_code: ffffff

I checked the BIOS smbios event log, and it only has the reboot code 0x17 showing the machine coming up after the reset, and it's not registered any memory resets like I expected.

Unfortunately the machine does not support IPMI, as the board is a supermicro X9DAi

I am not sure how to interpret the error code in that Hardware Error message, but it seems that 0000:80:02 corresponds to;

[root@pc03 ~]# lspci -s 0000:80:02
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)

I am currently monitoring the server for temps/cpu, and so I will have a good idea of the sensor states when it crashes next. Are there any other steps I can take to determine the root cause of this crashing?

Tom
  • 11,176
  • 5
  • 41
  • 63
  • 1
    Hm, the failing device isn't actually listed in your `lspci` output. I wouldn't bother going any further; just replace the motherboard. – Michael Hampton Jan 02 '17 at 21:34
  • Do you have an empty CPU socket? I see one other reference to this as a bug for a Cisco server, with the error being logged at bootup in the condition one CPU socket is not populated. The solution is the ignore the error message as, in that case at least, it did not actually cause a problem. Seems it could be a red herring. I think checking sensor info at the time of the next crash would be a sensible choice. – Dylan Knoll Jan 03 '17 at 17:21
  • @DylanKnoll No empty socket. – Tom Jan 04 '17 at 06:27
  • @MichaelHampton Ah, it seems the mboard is out of warranty. However I re-seated all the components, and I've not seen the (corrected) error since the day before yesterday, and no reboots, so maybe that has fixed the problem. However I'm taking the opportunity to learn some more about linux device troubleshooting is its not really come up for me before. (things just worked....) the lspci output is interesting – Tom Jan 04 '17 at 06:29
  • 1
    While you're at it, don't forget to update to a supported release. – Michael Hampton Jan 04 '17 at 06:57

0 Answers0