0

I am dealing a situation on NetBSD, where an NMI has put my box to DDB. I understand that NMI could be due some memory related problem. I guess the devices which are memory mapped could also lead me into the same scenario. Please correct me on this.

My understand is that I need to read status of all these devices, probably over pci.

I do not know what and how of any of it.

On receiving an NMI a trap is generated which puts NetBSD to DDB debugger. It is difficult to gain anything from DDB there. My plan is to return from trap without doing anything so that the error will cause a kernel core dump. Also, before returning from trap, I wanted to read the required registers/memory to dump status of the devices involved. This is my plan of action. Let me know if there is a better and right way to do that.

My aim is to understand from experts here and come up with a step-by-step plan to get to the source of NMI.

ultimate cause
  • 2,264
  • 4
  • 27
  • 44

2 Answers2

2

Intel describes platform-level error handling in a high-level document titled Platform-Level Error Handling Strategies for Intel Systems

That document doesn't specifically cover the Centerton (64-bit Atom) that you mention though (but it does give some good overview of how Intel thinks of hardware error reporting). However since the Centerton is a System-On-a-Chip device, we can find much more about how it works from the device data sheets. In volume one of the Intel Atom Processor S1200 chip datasheet we find the following text:

Internal Non-Maskable Interrupts (NMIs) can be generated by PCI Express ports and internally from the internal IOCHK# signal from the Low Pin Count interface signal LPC_SERIRQ.

We also find that there are external power management error signal pins which can generate a NMI in Atom based systems.

Undoubtably errors from the memory hardware could also be responsible for generating a NMI.

Volume 2 of the S1220 datasheet gives more detail about the many system registers involved in handling error signals.

None of this says much about NetBSD though. I don't think you can expect too much from NetBSD though. It doesn't have enough detailed knowledge of the many x86 systems that it runs on to decode specifics about hardware errors. It may be possible to access enough of the system registers through the NetBSD DDB in-kernel debugger, though I suspect this may be very tedious to do manually.

One avenue you might explore is whether the system BIOS is able to read and interpret the error registers, but unless your system also has a board management controller (unlikely for Atom systems, if I understand correctly), then it's unlikely there's any record of system errors kept somewhere where the BIOS can access them.

Greg A. Woods
  • 2,663
  • 29
  • 26
1

NMI - Non Maskable interrupt is generally raised by a hardware watchdog to indicate that CPU is hung and not due to invalid memory accesses (atleast in Mips/powerpc as I've some knowledge in them). Invalid memory accesses have seperate exceptions/interrupts to handle.

One of the cases where CPU is hung is due to dead lock or some similar conditions. So taking coredump and checking what each core was doing at the time of NMI should be one way to go forward.

Nithin
  • 191
  • 5
  • Thanks. I am already planning for that. My idea here is to get understanding on how to get to the source of NMI. That probably will be available only in a live system - For Example for I/O ports data may not be available in core dumps (I think). – ultimate cause Nov 01 '15 at 13:16
  • I actually "understood" first part of answer little later. Even on MIPS or PowerPC some unrecoverable hardware errors may generate NMI. I never said anything about an invalid memory access. I rather meant some main memory related hardware issue. Similar hardware issues on I/O address space may also generate an NMI. – ultimate cause Nov 01 '15 at 13:44