I've been programming a Linux kernel module for several years for a PCIe device. One of the main feature is to transfer data from the PCIe card to the host memory using DMA.
I'm using streaming DMA, i.e. it's the user program that allocates the memory, and my kernel module has to do the job of locking the pages and creating the scatter gather structure. It works correctly.
However, when used on some more recent hardware with Intel processors, the function calls dma_map_page()
and dma_unmap_page()
are taking much longer time to execute.
I've tried to use dma_map_sg()
and dma_unmap_sg()
, it takes approximately the same longer-time.
I've tried to split the dma_unmap_sg()
into a first call to dma_sync_for_cpu()
, followed by the call to dma_unmap_sg_attr()
with attribute DMA_ATTR_SKIP_CPU_SYNC
. It works correctly. And I can see the additional time is spend on the unmap operation, not for the sync.
I've tried to play with the Linux kernel command line parameters relating to the iommu (on, force, strict=0), and also intel_iommu, with no change in the behavior.
Some other hardware show a decent transfer rate, i.e. more than 6GB/s on PCIe3x8 (max 8GB/s).
The issue on some recent hardware is limiting transfer rate to ~3GB/s (I've checked that the card is correctly configured for PCIe3x8, and the programmer of the Windows device driver manages to achieve the 6GB/s on the same system. Things are more behind the curtains in Windows and I cannot get much information from it.)
On some hardware, the behavior is either normal or slowed, depending on the Linux distribution (and the Linux kernel version I guess). On some other hardware, the roles are reversed, i.e. the slow one becomes the fast one and vice-versa.
I cannot figure out the cause of this. Any clue?