I'm debugging a PCIe hardware issue on Linux and I want to enable PCIe AER driver on linux to catch any AER errors reported by my hardware device. I'm following this wiki:
https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt
My syslog shows AER is enabled
0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-108-generic root=UUID=a9f6d189-c13d-485c-a504-ba0aa0127e2e ro quiet splash aerdriver.forceload=y crashkernel=512M-:192M vt.handoff=1
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-108-generic root=UUID=a9f6d189-c13d-485c-a504-ba0aa0127e2e ro quiet splash aerdriver.forceload=y crashkernel=512M-:192M vt.handoff=1
[ 0.640130] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 0.661638] acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 0.678143] acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 0.694863] acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 4.747041] acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 4.751760] acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 4.758480] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 4.763990] acpi PNP0A08:07: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[ 5.463432] pcieport 0000:00:01.1: AER enabled with IRQ 34
[ 5.463450] pcieport 0000:00:07.1: AER enabled with IRQ 35
[ 5.463472] pcieport 0000:00:08.1: AER enabled with IRQ 37
[ 5.463517] pcieport 0000:10:01.1: AER enabled with IRQ 38
[ 5.463547] pcieport 0000:10:07.1: AER enabled with IRQ 39
[ 5.463575] pcieport 0000:10:08.1: AER enabled with IRQ 41
[ 5.463604] pcieport 0000:20:03.1: AER enabled with IRQ 42
[ 5.463635] pcieport 0000:20:07.1: AER enabled with IRQ 44
[ 5.463663] pcieport 0000:20:08.1: AER enabled with IRQ 46
[ 5.463782] pcieport 0000:30:03.1: AER enabled with IRQ 47
[ 5.463811] pcieport 0000:30:07.1: AER enabled with IRQ 49
[ 5.463843] pcieport 0000:30:08.1: AER enabled with IRQ 51
[ 5.463872] pcieport 0000:40:07.1: AER enabled with IRQ 62
[ 5.463895] pcieport 0000:40:08.1: AER enabled with IRQ 64
[ 5.463930] pcieport 0000:50:07.1: AER enabled with IRQ 66
[ 5.463965] pcieport 0000:50:08.1: AER enabled with IRQ 68
[ 5.464000] pcieport 0000:60:07.1: AER enabled with IRQ 70
[ 5.464044] pcieport 0000:60:08.1: AER enabled with IRQ 72
[ 5.464071] pcieport 0000:70:07.1: AER enabled with IRQ 74
[ 5.464099] pcieport 0000:70:08.1: AER enabled with IRQ 76
The hardware device is a Samsung SSD connected to the Root Complex by a PCIe switch
PCIe Topology : Root Complex - <PLDA PCIe switch + FPGA> - Samsung EVO SSD
Unfortunately, I'm seeing a lot of NVMe related errors but no AER errors are outputted.
Jun 26 12:13:54 ndra-Diesel kernel: [ 1080.672606] nvme1n1: p1 p2
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542592] nvme nvme1: I/O 832 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542617] nvme nvme1: I/O 833 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542627] nvme nvme1: I/O 834 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542636] nvme nvme1: I/O 835 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542645] nvme nvme1: I/O 872 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542654] nvme nvme1: I/O 873 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542662] nvme nvme1: I/O 874 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542670] nvme nvme1: I/O 875 QID 5 timeout, aborting
Jun 26 12:14:58 ndra-Diesel kernel: [ 1144.262425] nvme nvme1: I/O 832 QID 5 timeout, reset controller
Jun 26 12:15:29 ndra-Diesel kernel: [ 1174.982243] nvme nvme1: I/O 16 QID 0 timeout, reset controller
Jun 26 12:15:40 ndra-Diesel gnome-software[6474]: no app for changed ubuntu-dock@ubuntu.com
I have custom compiled my kernel with following options:
cat /boot/config-4.15.0-108-generic | grep -i PCIE
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_PCIEAER=y
CONFIG_PCIEAER_INJECT=y
CONFIG_PCIEPORTBUS=y
The lspci output for my Samsung NVMe shows that it has AER Capability:
37:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a801
Flags: bus master, fast devsel, latency 0, IRQ 54, NUMA node 3
Memory at b6500000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
Capabilities: [100] Advanced Error Reporting <----------------------------- SEE THIS
Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158] Power Budgeting <?>
Capabilities: [168] #19
Capabilities: [188] Latency Tolerance Reporting
Capabilities: [190] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
But the lscpi for PLDA switch doesn't show it has AER Capability
3:00.0 PCI bridge: PLDA XpressSwitch (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 53, NUMA node 3
Bus: primary=33, secondary=34, subordinate=3b, sec-latency=0
Memory behind bridge: b6400000-b65fffff
Capabilities: [80] Express Upstream Port, MSI 00
Capabilities: [e0] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [f8] Power Management version 3
Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
Capabilities: [300] #19
Kernel driver in use: pcieport
Kernel modules: shpchp
I have two Questions :
- The Samsung NVMe is behind the PLDA switch in the topology and the switch doesn't have AER capability. Can this be the reason I'm not seeing AER errors from the NVMe ?
- Do I need to do anything else to enable AER on linux ?