0

I am trying to optimize the reading of data via pcie via mmap. We have some tools that allow for reading/writing one word from the PCIe communication at the time, but I would like to get/write as many words as require in one request.

My project uses PCIe Gen3 with AXI bridges (2 PCIe bars).

I can successfully read any word from the bus but I notice a pattern when requesting data:

  • request data in address 0: AXI master requests 4 addresses of data, initial addr is 0
  • request data in address 0 and 1: two AXI requests: first is similar to the one above, follow by a read requests of 3 addresses of data, initial addr is 1
  • request data from address 0 to 2: 3 AXI requests: first two are similar to the previous one, follow by a read requests of 2 addresses of data, initial addr is 2

The pattern continues until the addr is a multiple of 4. In seems that if I request the first address, the AXI sends the first 4 values. Any hints? Could this be on the driver that I am using?

Here's how I use mmap:

        length_offset = tmp_offset_rw & ~(sysconf (_SC_PAGESIZE)-1);
    mmap_offset = (u_long)(tmp_barx_rw << 12) + length_offset;
    mmap_len = (u_long)(tmp_size * sizeof(int));
    mmap_address = mmap(NULL, mmap_len + (int)(tmp_offset_rw) - length_offset,
            PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_offset);

    close(fd);
    // tmp_reg_buf = new u_int[tmp_size];
    // memcpy(tmp_reg_buf, mmap_address , tmp_size*sizeof(int));
  
    // for(int i = 0; i < 4; i++)
    //   printf("0x%08X\n", tmp_reg_buf[i]);
    
    for(int i = 0; i < tmp_size; i++)
      printf("0x%08X\n", *((u_int*)mmap_address + (int)tmp_offset_rw - length_offset + i));

r0b0t1
  • 1
  • 1
  • I do not understand the wording of the question. How many bytes are you trying to read in each of these cases, and what is the alignment of the addresses? Please post some of the code that you are using to read from the memory mapped region. – Jamey Hicks Jul 30 '20 at 18:14
  • Thanks for the reply. What I am trying to say is that I would like to read multiple integers from the PCIe memory with just one request. It seems the kernel only supports single read/write transitions could that be case? Even tough it clearly send data requests which are multiple of 4. – r0b0t1 Jul 30 '20 at 21:37
  • Your code explicitly says to read 4 bytes at a time, so why would you expect something different? Have you tried using an instruction that reads more than 4 bytes? – prl Jul 31 '20 at 09:45
  • The behavior you describe sounds like the behavior of memcpy (which you have commented out in your example code). Memcpy is defined to perform byte accesses, and isn’t really suitable for accessing MMIO. It would be easier to answer your question if your description matches your code. – prl Jul 31 '20 at 09:48
  • The kernel isn’t involved in these accesses. It’s just your software and the hardware. – prl Jul 31 '20 at 09:51
  • Thank you for the replies, I am sorry my explanation/code does not reflect what I am trying to say, I will try to phrase it better. PCIe allows you to read multiple words in one TLP; as far as I understand, one could use mmap to request let's say 8 words in one single instruction. With this code I was expecting to see a single AXI word request to the PCIe everytime I want to read an integer but this is not the case: every time I request data (be it char, int, long) there is always a TLP of size 4 words being sent to the FPGA. – r0b0t1 Jul 31 '20 at 16:11

1 Answers1

0

First off, the driver just sets up the mapping between application virtual addresses and physical addresses, but is not involved in requests between the CPU and the FPGA.

PCIe memory regions are typically mapped in uncached fashion, so the memory requests you see in the FPGA correspond exactly to the width of the values the CPU is reading or writing.

If you disassemble the code you have written, you will see load and store instruction operating on different widths of data. Depending on the CPU architecture, load/store instructions requesting wider data widths may have address alignment restrictions, or there may be performance penalties for fetching unaligned data.

Different memcpy() implementations often have special cases so that they can the fewest possible instructions to transfer a certain amount of data.

The reason why memcpy() may not be suitable for MMIO is that memcpy() may read more memory locations than specified in order to use larger transfer sizes. If the MMIO memory locations cause side effects on read, this could cause problems. If you're exposing something that behaves like memory, it is OK to use memcpy() with MMIO.

If you want higher performance and there is a DMA engine available on the host side of PCIe or you can include a DMA engine in the FPGA, then you can arrange for transfers up to the limits imposed by PCIe protocol, the BIOS, and the configuration of the PCIe endpoint on the FPGA. DMA is the way to maximize throughput, with bursts of 128 or 256 bytes commonly available.

The next problem that needs to be addressed to maximize throughput is latency, which can be quite long. DMA engines need to be able to pipeline requests in order to mask the latency from the FPGA to the memory system and back.

Jamey Hicks
  • 2,340
  • 1
  • 14
  • 20