One of "my" DL165 G7 Proliants has rebooted out of the blue for the second time this month. The reboot was accompanied by these system event log entries in LightsOut:
Event Type Date Time Source Description Direction
OEM -- -- -- 00 00 00 00 01 02 00 00 00 00 00 00 00 --
Generic 07/19/2013 16:40:38 NMI Detect State Asserted Assertion
Generic 07/19/2013 16:40:42 Gen ID 0x41 Run-time Stop Assertion
OEM 07/19/2013 16:40:42 000137 01 80 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 02 54 44 4f 00 01 --
OEM 07/19/2013 16:40:42 000137 02 00 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 03 00 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 03 00 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 04 00 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 04 00 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 05 00 00 00 00 01 --
OEM 07/19/2013 16:40:42 000137 05 00 00 00 00 01 --
Generic 07/19/2013 16:43:54 Gen ID 0x41 C: boot completed Assertion
OEM 07/19/2013 16:43:54 000137 00 b4 6c e9 51 00 --
I have contacted HP support to get help decoding the events, but unfortunately without any notable success - I have been told that there is no accessible documentation available. What is it trying to tell me and how do I find out what is broken here?
Edit: the system is running Hyper-V 2012. The only useful event concerning the reset is Kernel-Power/41 with a BugcheckCode of 128 / 0x00000080 and BugcheckParameter1 of 0x4f4454 which match the first two OEM lines of the iLO event log (after you swap the bytes in little-endian manner, at least). The bugcheck code led me to this MSDN article which is bluntly stating that "the exact cause is difficult to determine".
In the HP support center, I could find a seemingly similar problem description with the solution being to synchronize the clocks between cluster nodes. While my breaking host indeed does run in a cluster, I have the clocks synchronized and I cannot reproduce the issue when the clocks are drifting apart (the obvious Kerberos authentication problems put aside, nothing much is happening if I desync the clocks).
The odd information I have been able to collect on the issue so far:
- A run-time stop entry in the IPMI event log indicates an OS blue screen (chapter 2.5.2 of the Winbond/Nuvoton WPCM450 BMC user guide)
- The IPMI documentation from the OpenIPMI project man page states that you cannot send OEM events using the standard interface
- NMIs seemed to be common in the past to signal ECC parity errors and initiate resets of the PC, but the information seems antiquated and in both cases I would expect appropriate event log entries telling me that errors or resets have occured - which I do not have.
- According to the bmc-device man page and this post from the vger Linux kernel mailing list, it seems like the generator ID of 0x41 means the NMI is triggered by the local management or the kernel.