0

I'm experiencing some issues with the network when I trigger a PCI rescan on Linux with echo 1 > /sys/bus/pci/rescan. I observe data loss, sometimes deadlocks in client/server applications or processes turning into zombie processes.

This happens on a node which consists of two Infiniband controllers and a few PCIe devices. I need to trigger a PCI rescan when one of these devices fails (in order to re-enumerate the PCIe tree and make the device be listed again):

  • ditribution: centos 7.2 (same on 7.1)
  • kernel: 3.10.0
  • OFED: OFED-3.1-1.0.3 (same with 3.4)
  • firmwares: 12.17.1010 (Mellanox MT27700 Family [ConnectX-4])
  • grub boot option: pci=realloc=on

Is it possible to rescan the PCI while there is some network activity without causing issues? If not, is there a more selective way to re-enumarate just a part of the PCIe bus?

jyvet
  • 2,021
  • 15
  • 22
  • Do you see any change in the PCI configuration of the NIC or the switches that connect to it before and after the rescan? If the rescan changes for instance the MMIO addresses under the driver's feet it can cause the driver to malfunction. – haggai_e Dec 18 '16 at 12:01
  • Thx for the reply. I launch `echo 1 > /sys/bus/pci/rescan` two times in a row and I get errors in `dmesg`: *[ 326.769911] mgag200 0000:08:03.0: BAR 6: [??? 0x00000000 flags 0x2] has bogus alignment* `find /sys/devices/pci* -name "*ib[0-9]*"` `/sys/devices/pci0000:00/0000:00:03.2/0000:06:00.0/net/ib0` `/sys/devices/pci0000:80/0000:80:03.0/0000:83:00.0/net/ib1` Does it have to do with OFED drivers? I can reproduce this behavior on a different platform (other mobo/CPU and EDR instead of FDR). – jyvet Dec 19 '16 at 17:58
  • I think mgag200 is a GPU driver. Perhaps you can set the kernel console to enable debug prints on the PCI rescan process and show what it does that relates to the HCA or the PCI bridges on its path. – haggai_e Dec 19 '16 at 18:41

0 Answers0