0

I'm writing a linux kernel device driver for a custom PCIe device. An user space application is mmapped to this device and frequently accessing its memory (read and write). The PCIe device is driven by an external power supply which may be turned off during runtime.

Whenever the device is reset, all memory reads of my user application return 0xFFFFFFFF. I want to detect device resets as soon as possible in the kernel driver, so I implemented an error_detected callback function according to https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html.

static pci_ers_result_t mydevice_error_detected(struct pci_dev* dev, pci_channel_state_t state) {
   printk(KERN_ALERT "mydevice PCI error detected");
   return PCI_ERS_RESULT_DISCONNECT;
}

static struct pci_error_handlers mydevice_error_handlers = {
   .error_detected = mydevice_error_detected,
   .slot_reset = mydevice_slot_reset,
   .resume = mydevice_resume
};

static struct pci_driver mydevice_driver = {
   .name = "mydevice",
   .id_table = mydevice_ids,
   .probe = mydevice_probe,
   .remove = mydevice_remove,
   .suspend = mydevice_suspend,
   .resume = mydevice_resume,
   .err_handler = &mydevice_error_handlers
};

However, mydevice_error_detected is never called during device reset, even though the user space application is continuously trying to unsuccessfully read device memory (and get 0xFFFFFFFF as result).

Also, lspci still lists the device after PCI rescan, even if it got turned off:

01:00.0 Unassigned class [ff00]: MyVendorId Device 5a00 (rev ff)

The only difference is that "rev ff" occurs at the end of the line when the device is in turned off state. Otherwise lspci returns

01:00.0 Unassigned class [ff00]: MyVendorId Device 5a00

I'm pretty sure the device is completely turned off, since configuration space can not be accessed during reset. I'd expect the kernel to call the error detection callback whenever the first memory read request to the device fails/timeouts. Is my assumption correct?

ApiTiger
  • 31
  • 2
  • Do you enable error handling in your driver init routine? – stark Mar 24 '20 at 20:14
  • @stark I do not explicitly enable any error handling in the init function. I just register the shown mydrive_driver struct to the kernel, which contains the error handlers. – ApiTiger Mar 24 '20 at 21:03
  • You have few issues there AFAICT. First, error handler is about something different. That's why its a driver developer responsibility to test device presence before **each critical I/O** to the device! Second, your device behave badly (according to the spec), i.e. turned off device shouldn't provide vendor/device ID (looks like yours based on some FPGA with wrong PCI logic). Third, accessing to device's memory like above is not recommended in general. And if you need so, you have to take care of all possible corner cases (maybe high bandwidth NIC drivers with DPDK can be used as an example). – 0andriy Mar 25 '20 at 17:12

0 Answers0