2

I've been programming a Linux kernel module for several years for a PCIe device. One of the main feature is to transfer data from the PCIe card to the host memory using DMA.

I'm using streaming DMA, i.e. it's the user program that allocates the memory, and my kernel module has to do the job of locking the pages and creating the scatter gather structure. It works correctly.

However, when used on some more recent hardware with Intel processors, the function calls dma_map_page() and dma_unmap_page() are taking much longer time to execute.

I've tried to use dma_map_sg() and dma_unmap_sg(), it takes approximately the same longer-time.

I've tried to split the dma_unmap_sg() into a first call to dma_sync_for_cpu(), followed by the call to dma_unmap_sg_attr() with attribute DMA_ATTR_SKIP_CPU_SYNC. It works correctly. And I can see the additional time is spend on the unmap operation, not for the sync.

I've tried to play with the Linux kernel command line parameters relating to the iommu (on, force, strict=0), and also intel_iommu, with no change in the behavior.

Some other hardware show a decent transfer rate, i.e. more than 6GB/s on PCIe3x8 (max 8GB/s).

The issue on some recent hardware is limiting transfer rate to ~3GB/s (I've checked that the card is correctly configured for PCIe3x8, and the programmer of the Windows device driver manages to achieve the 6GB/s on the same system. Things are more behind the curtains in Windows and I cannot get much information from it.)

On some hardware, the behavior is either normal or slowed, depending on the Linux distribution (and the Linux kernel version I guess). On some other hardware, the roles are reversed, i.e. the slow one becomes the fast one and vice-versa.

I cannot figure out the cause of this. Any clue?

0andriy
  • 4,183
  • 1
  • 24
  • 37
Didier Trosset
  • 36,376
  • 13
  • 83
  • 122
  • 1
    What is in your dma_ops unmap_sg code. – stark Sep 13 '22 at 19:06
  • 1
    In both cases do you have IOMMU enabled? – 0andriy Sep 14 '22 at 06:20
  • @stark I did not define any code in the dma_ops unmap_sg. – Didier Trosset Sep 14 '22 at 06:26
  • @0andriy I tried with iommu=off, iommu=force, intel_iommu=on, and intel_iommu=off. – Didier Trosset Sep 14 '22 at 06:27
  • 1
    Investigate `dmesg` output in all cases, it might be that bounce buffers take place. – 0andriy Sep 14 '22 at 06:31
  • I've timed the operations. Between the register write to the card that actually starts the DMA transfer, up to the actual end of the transfer, detected by either polling or interrupt, time shows that data flows at more than 6GB/s. It's really the calls to `dma_map` and `dma_unmap` that take more time than the transfer itself! – Didier Trosset Sep 14 '22 at 06:56
  • Thanks @0andriy. Looks like bounce buffers are taking place. Didn't know they existed. – Didier Trosset Sep 14 '22 at 20:20
  • You could try `intel_iommu=on iommu=pt` (`pt` means "pass-through"). It probably won't make any difference as it's mostly for use when host PCI resources are used by virtual machines, but it shouldn't make things any worse (fingers crossed). – Ian Abbott Sep 16 '22 at 15:42

0 Answers0