0

I'm debugging a PCIe hardware issue on Linux and I want to enable PCIe AER driver on linux to catch any AER errors reported by my hardware device. I'm following this wiki:

https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt

My syslog shows AER is enabled

    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-108-generic root=UUID=a9f6d189-c13d-485c-a504-ba0aa0127e2e ro quiet splash aerdriver.forceload=y crashkernel=512M-:192M vt.handoff=1
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-108-generic root=UUID=a9f6d189-c13d-485c-a504-ba0aa0127e2e ro quiet splash aerdriver.forceload=y crashkernel=512M-:192M vt.handoff=1
[    0.640130] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    0.661638] acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    0.678143] acpi PNP0A08:02: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    0.694863] acpi PNP0A08:03: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    4.747041] acpi PNP0A08:04: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    4.751760] acpi PNP0A08:05: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    4.758480] acpi PNP0A08:06: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    4.763990] acpi PNP0A08:07: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    5.463432] pcieport 0000:00:01.1: AER enabled with IRQ 34
[    5.463450] pcieport 0000:00:07.1: AER enabled with IRQ 35
[    5.463472] pcieport 0000:00:08.1: AER enabled with IRQ 37
[    5.463517] pcieport 0000:10:01.1: AER enabled with IRQ 38
[    5.463547] pcieport 0000:10:07.1: AER enabled with IRQ 39
[    5.463575] pcieport 0000:10:08.1: AER enabled with IRQ 41
[    5.463604] pcieport 0000:20:03.1: AER enabled with IRQ 42
[    5.463635] pcieport 0000:20:07.1: AER enabled with IRQ 44
[    5.463663] pcieport 0000:20:08.1: AER enabled with IRQ 46
[    5.463782] pcieport 0000:30:03.1: AER enabled with IRQ 47
[    5.463811] pcieport 0000:30:07.1: AER enabled with IRQ 49
[    5.463843] pcieport 0000:30:08.1: AER enabled with IRQ 51
[    5.463872] pcieport 0000:40:07.1: AER enabled with IRQ 62
[    5.463895] pcieport 0000:40:08.1: AER enabled with IRQ 64
[    5.463930] pcieport 0000:50:07.1: AER enabled with IRQ 66
[    5.463965] pcieport 0000:50:08.1: AER enabled with IRQ 68
[    5.464000] pcieport 0000:60:07.1: AER enabled with IRQ 70
[    5.464044] pcieport 0000:60:08.1: AER enabled with IRQ 72
[    5.464071] pcieport 0000:70:07.1: AER enabled with IRQ 74
[    5.464099] pcieport 0000:70:08.1: AER enabled with IRQ 76

The hardware device is a Samsung SSD connected to the Root Complex by a PCIe switch

PCIe Topology : Root Complex - <PLDA PCIe switch + FPGA> - Samsung EVO SSD

Unfortunately, I'm seeing a lot of NVMe related errors but no AER errors are outputted.

Jun 26 12:13:54 ndra-Diesel kernel: [ 1080.672606]  nvme1n1: p1 p2
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542592] nvme nvme1: I/O 832 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542617] nvme nvme1: I/O 833 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542627] nvme nvme1: I/O 834 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542636] nvme nvme1: I/O 835 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542645] nvme nvme1: I/O 872 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542654] nvme nvme1: I/O 873 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542662] nvme nvme1: I/O 874 QID 5 timeout, aborting
Jun 26 12:14:27 ndra-Diesel kernel: [ 1113.542670] nvme nvme1: I/O 875 QID 5 timeout, aborting
Jun 26 12:14:58 ndra-Diesel kernel: [ 1144.262425] nvme nvme1: I/O 832 QID 5 timeout, reset controller
Jun 26 12:15:29 ndra-Diesel kernel: [ 1174.982243] nvme nvme1: I/O 16 QID 0 timeout, reset controller
Jun 26 12:15:40 ndra-Diesel gnome-software[6474]: no app for changed ubuntu-dock@ubuntu.com

I have custom compiled my kernel with following options:

cat /boot/config-4.15.0-108-generic | grep -i PCIE
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_PCIEAER=y
CONFIG_PCIEAER_INJECT=y
CONFIG_PCIEPORTBUS=y

The lspci output for my Samsung NVMe shows that it has AER Capability:

37:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Flags: bus master, fast devsel, latency 0, IRQ 54, NUMA node 3
        Memory at b6500000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [40] Power Management version 3
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable+ Count=33 Masked-
        Capabilities: [100] Advanced Error Reporting <----------------------------- SEE THIS
        Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [158] Power Budgeting <?>
        Capabilities: [168] #19
        Capabilities: [188] Latency Tolerance Reporting
        Capabilities: [190] L1 PM Substates
        Kernel driver in use: nvme
        Kernel modules: nvme 

But the lscpi for PLDA switch doesn't show it has AER Capability

3:00.0 PCI bridge: PLDA XpressSwitch (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 53, NUMA node 3
        Bus: primary=33, secondary=34, subordinate=3b, sec-latency=0
        Memory behind bridge: b6400000-b65fffff
        Capabilities: [80] Express Upstream Port, MSI 00
        Capabilities: [e0] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [f8] Power Management version 3
        Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
        Capabilities: [300] #19
        Kernel driver in use: pcieport
        Kernel modules: shpchp

I have two Questions :

  1. The Samsung NVMe is behind the PLDA switch in the topology and the switch doesn't have AER capability. Can this be the reason I'm not seeing AER errors from the NVMe ?
  2. Do I need to do anything else to enable AER on linux ?
h1990
  • 125
  • 3
  • 11
  • AFAIK, the switch doesn't need AER support to route error messages to the root complex. – haggai_e Jun 27 '20 at 13:16
  • Also, you can check the endpoint's detailed AER capability with `lspci -vv`. You can check whether the device actually reported any error, and whether they are masked or not. – haggai_e Jun 27 '20 at 13:18

1 Answers1

0

Try adding this to the end of your config file:

pcie_ports=native
Blaine McMahon
  • 167
  • 1
  • 4