1

Anyone know what this error below (dmesg output) indicates? I'm getting it when periodically writing to an Intel NVME drive (attached to a PCI card) under Linux. Not sure if "no further action" means I should just ignore it or if the PCI card is just junk.

[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]: It has been corrected by h/w and requires no further action
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]: event severity: corrected
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:  Error 0, type: corrected
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   section_type: PCIe error
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   port_type: 0, PCIe end point
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   version: 3.0
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   command: 0x0506, status: 0x0010
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   device_id: 0000:17:00.0
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   slot: 0
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   secondary_bus: 0x00
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   vendor_id: 0x8086, device_id: 0xf1a6
[Mon Oct  1 13:46:53 2018] {24}[Hardware Error]:   class_code: 020801
[Mon Oct  1 13:46:53 2018] nvme 0000:17:00.0: aer_status: 0x000010c0, aer_mask: 0x00002000
[Mon Oct  1 13:46:53 2018] Bad TLP, Bad DLLP, Replay Timer Timeout
[Mon Oct  1 13:46:53 2018] nvme 0000:17:00.0: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[Mon Oct  1 14:21:56 2018] perf: interrupt took too long (3147 > 3135), lowering kernel.perf_event_max_sample_rate to 63500
Server Fault
  • 3,714
  • 12
  • 54
  • 89

1 Answers1

1

That is a RAS feature telling you that there was an error but it was corrected. No further action is needed on this specific fault. A high rate of corrected errors sometimes is an early indicator of failure.

A reasonable response is somewhere in between ignore it and junk the disk. Have a spare ready, verify backups, and check if it has redundancy as a part of an array.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34