What can be a reason of PCI MMIO read data delay?

Question

Our team is currently working on custom device.
There is Cyclone V board with COM Express amd64-based PC plugged into it. This board works as PCIe native endpoint. It switches on first, then it switches PC on. PC has linux running on it with kernel 4.10 and some drivers and software working with PCI BAR0 via MMIO.
System works flawlessly until the first reboot from linux terminal. On the next boot MMIO read access is broken, while MMIO write access is OK.

Let's say there are two offsets to read, A and B, with values 0xa and 0xb respectively. Now if we read bytes from these offsets, it seems as if there is delay by 8 read operations in values retrieved:

read A ten times - returns 0xa every time
read B eight times - returns 0xa every time
read B ten times - returns 0xb every time
read A once - returns 0xb
read B seven times - returns 0xb every time
read B once - returns 0xa
read B ten times - returns 0xb

In case offsets A and B are within the same 64-bit word all works as expected.
MMIO access is done via readb/readw/readl/readq functions, the actual function used does not affect this delay at all.
Sequential reboots may fix or break MMIO reads again.

From the linux point of view, mmiotrace gives the same picture with broken data.
From device point of view, signaltap logic analyzer shows valid data values on PCIe core bus.
We have no PCI bus analyzer device, so we do not know any possibility to check data exchange in between those two points.
What can be a reason for such behaviour and how it can be fixed?

You need to mark resources as uncacheable region. In fact you just read a CPU cache lines instead of real bus reads. — 0andriy, May 29 '18 at 16:23
@0andriy, thank you for your response. As I can see from pci_resource_flags() call result, there is no IORESOURCE_CACHEABLE flag set. — yurimz, May 30 '18 at 14:51
It's point of view of the bus, what I'm talking is point of view of CPU to the same region. For example, x86 has set of MTRR registers that defines flags of memory areas. — 0andriy, May 30 '18 at 16:55
@0andriy, it seems that's not a case. pcim_iomap() uses ioremap_nocache() inside, which gives _PAGE_CACHE_MODE_UC_MINUS by default. Forced ioremap_uc() usage did not change anything. Even 5-minute delays between reads change nothing. MMIO writes are not affected and always come in order. On first boot after complete power off system works flawlessly. — yurimz, May 31 '18 at 14:09
And how are you performing the reads within the FPGA? What PCIe core? What internal interface? And what component are the reads targeting? Could the internal read requests and responses be getting out of sync somehow, with responses stuck in a FIFO? — alex.forencich, Jun 06 '18 at 09:21
Rather, a bunch of extra read responses, possibly stuck in an internal FIFO inside the PCIe core? — alex.forencich, Jun 06 '18 at 09:23
@alex.forencich, yes, delay is always 8 operations. Moreover, if we are performing reads with different bitness, we retrieve results with delay of 8 operations, but with correct bitness. Reading 16-bit word always returns 16-bit word. Both signaltap and mmiocheck within linux kernel show 16-bit value no matter what bitness was used 8 operations before. Regarding your questions on FPGA part of problem, I am not able to answer them right now, as colleague of mine is on vacation. I will answer as soon as he returns. — yurimz, Jun 07 '18 at 14:19

What can be a reason of PCI MMIO read data delay?

0 Answers0