0

I am running fio jobs on my NVMe SSD and hotplug it then. The platform is hot-pluggable and the system is Centos 7.0.Several seconds after my plug-out, the system encounters a crash and gives these print info:

================

[ 1026.468414] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1

[ 1026.468422] pciehp 0000:5d:02.0:pcie04: Card present on Slot(6-1)

[ 1026.468432] pciehp 0000:5d:02.0:pcie04: slot(6-1): Link Down event

[ 1026.468451] pciehp 0000:5d:02.0:pcie04: Link Down event queued on slot(6-1): currently getting powered on

[ 1026.468457] pciehp 0000:5d:02.0:pcie04: Already enabled on slot(7-1)

[ 1026.468705] {1}[Hardware Error]: event severity: fatal

[ 1026.468744] {1}[Hardware Error]: Error 0, type: fatal

[ 1026.468782] {1}[Hardware Error]: section_type: PCIe error

[ 1026.468825] {1}[Hardware Error]: port_type: 0, PCIe end point

[ 1026.468867] {1}[Hardware Error]: version: 3.0

[ 1026.468915] {1}[Hardware Error]: command: 0x0102, status: 0x4010

[ 1026.468961] {1}[Hardware Error]: device_id: 0000:00:00.0

[ 1026.469901] {1}[Hardware Error]: slot: 0

[ 1026.469032] {1}[Hardware Error]: secondary_bus: 0x00

[ 1026.469070] {1}[Hardware Error]: vendor_id: 0x1ded, device_id: 0x3010

[ 1026.469117] {1}[Hardware Error]: class_code: 008001

[ 1026.469155] Kernel panic - not syncing: Fatal hardware error!

================

The possible root cause for system crash is that the contradictory event pair that "card present" and "link down" have messed up the system logic. So what confuses me is that pciehp reports both "card present" and "link down" at the same time. As my experience, "card present" often comes with "link up" and "link down" normally goes by "card not present".

Could anybody give me some clues about how this strange situation happens? Or which bit in PCIe register trigger "card present" event and "link down" event?

  • Hi there! Stackoverflow is for programming questions so you might get a better reception for server question over on https://serverfault.com/ ... – Anon Nov 04 '18 at 08:57
  • Thanks for your reminder! I have put another one on serverfault.com. Should I close the problem here? – Leo Erzhuo Chen Nov 07 '18 at 03:21
  • Yes, delete it here (as it means you recognize the accident). Cross posting on stackoverflow/stack exchange sites generally discouraged (https://meta.stackexchange.com/a/64069 ) but you're new here and still learning the ropes so you didn't do it out of malintent :-) Good luck! – Anon Nov 07 '18 at 06:20

0 Answers0