Using memcpy on mmap'ed region crashes, a for loop does not

Question

I have an NVIDIA Tegra TK1 processor module on a carrier board with a PCI-e slot connecting to it. In that PCIe slot is an FPGA board which exposes some registers and a 64K memory area via PCIe.

On the ARM CPU of the Tegra board, a minimal Linux installation is running.

I am using /dev/mem and the mmap function to obtain user-space pointers to the register structs and the 64K memory area. The distinct register files and the memory block are all assigned addresses which are aligned and do not overlap with regards to 4KB memory pages. I explicitly map whole pages with mmap, using the result of getpagesize(), which also is 4096.

I can read/write from/to those exposed registers just fine. I can read from the memory area (64KB), doing uint32 word-by-word reads in a for loop, just fine. I.e. read contents are correct.

But if I use std::memcpy on the same address range, though, the Tegra CPU freezes, always. I do not see any error message, if GDB is attached I also don't see a thing in Eclipse when trying to step over the memcpy line, it just stops hard. And I have to reset the CPU using the hardware reset button, as the remote console is frozen.

This is debug build with no optimization (-O0), using gcc-linaro-6.3.1-2017.05-i686-mingw32_arm-linux-gnueabihf. I was told the 64K region is accessible byte-wise, I did not try that explicitly.

Is there an actual (potential) problem that I need to worry about, or is there a specific reason why memcpy does not work and maybe should not be used in the first place in this scenario - and I can just carry on using my for loops and think nothing of it?

EDIT: Another effect has been observed: The original code snippet was missing a "vital" printf in the copying for loop, that came before the memory read. That removed, I don't get back valid data. I now updated the code snippet to have an extra read from the same address instead of the printf, which also yields correct data. The confusion intensifies.

Here the (I think) important excerpts of what's going on. With minor modifications, to make sense as shown, in this "de-fluffed" form.

// void* physicalAddr: PCIe "BAR0" address as reported by dmesg, added to the physical address offset of FPGA memory region
// long size: size of the physical region to be mapped 

//--------------------------------
// doing the memory mapping
//

const uint32_t pageSize = getpagesize();
assert( IsPowerOfTwo( pageSize ) );

const uint32_t physAddrNum = (uint32_t) physicalAddr;
const uint32_t offsetInPage = physAddrNum & (pageSize - 1);
const uint32_t firstMappedPageIdx = physAddrNum / pageSize;
const uint32_t lastMappedPageIdx = (physAddrNum + size - 1) / pageSize;
const uint32_t mappedPagesCount = 1 + lastMappedPageIdx - firstMappedPageIdx;
const uint32_t mappedSize = mappedPagesCount * pageSize;
const off_t targetOffset = physAddrNum & ~(off_t)(pageSize - 1);

m_fileID = open( "/dev/mem", O_RDWR | O_SYNC );
// addr passed as null means: we supply pages to map. Supplying non-null addr would mean, Linux takes it as a "hint" where to place.
void* mapAtPageStart = mmap( 0, mappedSize, PROT_READ | PROT_WRITE, MAP_SHARED, m_fileID, targetOffset );
if (MAP_FAILED != mapAtPageStart)
{
    m_userSpaceMappedAddr = (volatile void*) ( uint32_t(mapAtPageStart) + offsetInPage );
}

//--------------------------------
// Accessing the mapped memory
//

//void* m_rawData: <== m_userSpaceMappedAddr
//uint32_t* destination: points to a stack object
//int length: size in 32bit words of the stack object (a struct with only U32's in it)

// this crashes:
std::memcpy( destination, m_rawData, length * sizeof(uint32_t) );

// this does not, AND does yield correct memory contents - but only with a preceding extra read
for (int i=0; i<length; ++i)
{
    // This extra read makes the data gotten in the 2nd read below valid.
    // Commented out, the data read into destination will not be valid.
    uint32_t tmp = ((const volatile uint32_t*)m_rawData)[i];
    (void)tmp; //pacify compiler

    destination[i] = ((const volatile uint32_t*)m_rawData)[i];
}

It would be nice if you could show some code of what you are doing. — Qubit, Sep 04 '18 at 09:03
Welcome to stackoverflow.com. Please take some time to read [the help pages](http://stackoverflow.com/help), take [the SO tour](http://stackoverflow.com/tour), read about [how to ask good questions](http://stackoverflow.com/help/how-to-ask), and read [this question checklist](https://codeblog.jonskeet.uk/2012/11/24/stack-overflow-question-checklist/). Lastly please learn how to create a [Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve). — Some programmer dude, Sep 04 '18 at 09:04
What happens, if you put a breakpoint into `memcpy`, and execute it instruction by instruction? Does it crash? If yes, what is the instruction, and register values it crashes on? — geza, Sep 04 '18 at 09:25
You are not necessarily able to access memory mapped to external hardware with arbitrary assembly instructions. Sometimes 32-bit-wide access is OK and byte-wide access or 64-bit-wide access is not. You need to understand your hardware limitations. I used to work with such device, only DWORD instructions gave correct results with it. — n. m. could be an AI, Sep 04 '18 at 09:48
geza: I instruction-stepped now for 10 minutes in memcpy and 2 of my fingers are tired for now ;) (one for pressing step, one for pressing print) It does a lot of stuff and jumps around without crashing, so I don't know yet at which instruction precisely it will finally crash. — sktpin, Sep 04 '18 at 09:52
n.m.: interesting, so I guess I'd need to know what kind of funky optimized instructions memcpy uses on that target/compiler, that may not agree with the hardware. Does make sense! — sktpin, Sep 04 '18 at 09:54
@sktpin: With gdb you can easily create a loop. But supposedly, it won't crash, if it didn't crashed yet (I mean, supposedly you processed enough instructions to execute the main loop of memcpy several times). — geza, Sep 04 '18 at 09:57
@sktpin: maybe it's a HW related issue. Sends memory write too fast, or something like that. What happens, if you use `-O2`? Does it crash with the manual loop version? — geza, Sep 04 '18 at 09:58
This is just a shot in the dark, but have you checked if the destination and source memory overlaps? Does `std::memmove` work? — Max Vollmer, Sep 04 '18 at 10:21
@geza: Now that's interesting! Haven't tried -O2 yet, but it turns out in the copying for loop, I had a printf (not shown in my snippet), which turned out to be vital. Commenting that out, I don't geat real data back. Now, instead of that printf, which is a comparably heavy operation, I just put a line that reads a 32bit word into a temp variable, then the read from the same address is performed into the real destination. That also works... I'll update my code snippet. — sktpin, Sep 04 '18 at 10:36
@geza: Does not crash with -O2, with or without the preceding read from same address into a temp variable. The effect that I only get valid data with the preciding extra read is true for both, -O0 and -O2. — sktpin, Sep 04 '18 at 10:52
@geza: Now I changed the loop to do byte-wize access instead of 32bit word wise. Using -O2 or -O0, data seems to check out, without the funny extra read, and no crash in my for loop. It seems that thing isn't really up to 32bit reading. The registers, though, which are on a different one of the multiple PCIe adress "BAR"s reported in dmesg, do read fine 32bit at a time without seeing funny effects so far. — sktpin, Sep 04 '18 at 11:20
@sktpin: isn't there a documentation on this? Maybe it's written somewhere, you don't have to find these out the hard way... — geza, Sep 04 '18 at 11:29
@geza the FPGA content is done by someone else, he is not aware of a reason why these things should be happening. He's never used PCI-e before either, though. I'd find it weird to see anything produced in the past decades, using such buses, not working with 32bit accesses. So I interpret what I'm seeing as a side effect of something else. (probably that thing isn't even really doing byte wise accesses... but currently I don't know of a way to test that) — sktpin, Sep 04 '18 at 13:19
@sktpin: I see. As your issue is very likely HW related, I don't think that SO can solve it. Anyways, if you find out the root cause of your problem, please share it, I'm curious :) — geza, Sep 04 '18 at 13:38
I'm not sure, but shouldn't you be mapping via `/sys/bus/pci/devices/B:D:F/resource0` instead? You're sort of bypassing any PCI-e specific handling for the mapping here. — Hasturkun, Sep 05 '18 at 06:55
@Hasturkun There is no such dir. And wouldn't that be pcie? I see this, under "/sys/bus/pci_express/devices/": 0000:00:00.0:pcie01 0000:00:00.0:pcie02. But I couldn't map those devices, or can I? The reason for mapping the device addresses via /dev/mem was to not have to write device drivers, it's all prototyping / proof of concept right now and I'm looking for the least time intensive ways of getting stuff to work. I never wrote a kernel driver, and from what I read so far, it can be quite a nightmare to debug. Not that it's a lot better right now, or perhaps it is - I do not know ;) — sktpin, Sep 05 '18 at 13:34
Does the actual device directory not have a `resource0` or `resource0_wc` file? These should AFAICT be provided by the kernel's built-in PCI support (and allow you to mmap the resources), and I think works similarly for PCI-e. Check out the [kernel's sysfs-pci](https://www.kernel.org/doc/Documentation/filesystems/sysfs-pci.txt) documentation for details. — Hasturkun, Sep 05 '18 at 13:40
@geza Maybe ;) I thought there could have been a clear answer for this, some way memcpy works that's no compatible in this scenario. Meanwhile I tried also with busybox' devmem command to poke in that memory, it yields the same result: only every 2nd read from an address is correct. It does not crash. So at least it's not a ghost in my program, there really is a problem ;) — sktpin, Sep 05 '18 at 13:51
`memcpy` doesn't work with volatile objects; if the source could be modified during the operation then all bets are off — M.M, Sep 20 '18 at 00:45

score 1 · Answer 1 · answered Sep 20 '18 at 00:13

1

Based on the description, it looks like your FPGA code is not responding correctly to load instructions that are reading from locations on your FPGA and it is causing the CPU to lock up. It's not crashing it is permanently stalled, hence the need for the hard reset. I had this problem also when debugging my PCIE logic on an FPGA.

Another indication that your logic is not responding correctly is that you need an extra read in order to get the right responses.

Your loop is doing 32-bit loads but memcpy is doing at least 64-bit loads, which changes how your logic responds. For example, it will need to use two TLPs with 32 bits of response if the first 128 bits of the completion and the next 32 bits in the second 128 bit TLP of the completion.

What I found super-useful was to add logic to log all the PCIE transactions into an SRAM and to be able to dump the SRAM out to see how the logic was behaving or misbehaving. We have a nifty utility, pcieflat, that prints one PCIE TLP per line. It even has documentation.

When the PCIE interface is not working well enough, I stream the log to a UART in hex which can be decoded by pcieflat.

This tool is also useful for debugging performance problems -- you can look at how well your DMA reads and writes are pipelined.

Alternatively, if you have integrated logic analyzer or similar on the FPGA, you can trace the activity that way. But it's nicer to have the TLPs parsed according to PCIE protocol.

answered Sep 20 '18 at 00:13

Jamey Hicks

2,340
1
14
20

The "needs two reads from same address to get 1 data item" issue has been eliminated, but memcpy still "crashes". Reading a 64K buffer 32bit wise in a for loop, I get ~ 2MB/s, far too slow for our needs. That's not using DMA yet, I'm just looking into that. There is a project on github called "udmabuf" which allows to allocate a kernel buffer, for the DMA to write into, and userspace app to read out of, let's see how that goes. Anyway, I will forward the info you gave, esp. the debugging tips, thanks! – sktpin Sep 20 '18 at 08:20
Btw, I had read somewhere else that for a PCIe BAR, the prefetched bit needs to be set in order for memcpy to work. So the FPGA guy found a way to enable that. Then, dmseg did show that BAR to be prefetched, but memcpy still crashes. – sktpin Sep 20 '18 at 15:22
Our `portalmem` driver also enables allocation of DMA buffers via ioctl so that you can allocate or free as many as you want. We implemented a simple MMU for the FPGA so that we can have large buffers without requiring contiguous memory. https://github.com/cambridgehackers/connectal/tree/master/drivers/portalmem – Jamey Hicks Sep 20 '18 at 16:13
What is your performance requirement? How many lanes does your PCIe connection have? It's likely you will need read/write DRAM from the FPGA in order to get high bandwidth transfers. You need quite a few requests in flight at a time to get close to the maximum bandwidth of PCIe. – Jamey Hicks Sep 20 '18 at 16:15
Somewhat under 50 MB/s. The current eval board has 4 lanes, another has only one. – sktpin Sep 21 '18 at 10:21
Even the 1x board should be able to easily deliver 50MB/s. I think you should be able to transfer 400MB/s (80% of PCIe gen2 1x). – Jamey Hicks Sep 21 '18 at 13:24
I tried the Xilinx-provided "xdma" samples now, their page says it's for x86 only. I cross compiled their kernel module against "my" kernel source code for ARMhf, et voila, seems to work, funny. Using one of their little programs in the /test folder of that example (Xilinx Answer 65444), dma_from_device.c, modified slightly, I shoveled some data over 20secs, got ~220 MB/s. I do not understand every aspect of the code yet, probably some overhead involved. Weirdly, the byte order is swapped to big-e at some place, but data looks legit otherwise... so I'll probably use that code. – sktpin Sep 24 '18 at 12:38
Btw., do you know whether /dev/mem + mmap mappings to device registers are subject to caching? I've read contradictory info about that, and am seeing some weird effects like a different number in the device's buffer write index depending on where I have a breakpoint in my code or not (while a long delay at the position does nothing) – sktpin Sep 28 '18 at 10:12
According to Xilinx_Answer_65444_Linux.pdf, the mem mapping is not cached when opening the /dev/mem device file with flag O_SYNC. I have not found a reference for that anywhere... If that's true, the problem lies elsewhere. – sktpin Oct 09 '18 at 13:50

Using memcpy on mmap'ed region crashes, a for loop does not

1 Answers1