I have a cpu cache coherency-looking problem that I can't figure out how to fix. Two cpus see different contents of the same memory

Question

I have a really weird problem I can't figure out, I haven't seen anything this unexplainable in my 30+ years of programming. Clearly I'm doing something wrong, but can't figure out what, and I can't even figure out a way around it.

I have a linux kernel module I've written that implements a block device. It calls out to userspace to supply the data for the block device via ioctl (as in the userspace program calls the kernel module via an ioctl to get block device requests)

Some technical information on the machines I'm testing on in case it matters:

It runs flawlessly on an intel core2 i7 somethingoroother.

> cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
stepping        : 9
microcode       : 0x21
cpu MHz         : 1798.762
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 7139.44
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor 1-7 are the same

It runs flawlessly on a raspberry pi 0

> cat /proc/cpuinfo 
processor       : 0
model name      : ARMv6-compatible processor rev 7 (v6l)
BogoMIPS        : 997.08
Features        : half thumb fastmult vfp edsp java tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xb76
CPU revision    : 7

Hardware        : BCM2835
Revision        : 920093
Serial          : 000000002d5dfda3

It runs flawlessly on a raspberry pi 3

> cat /proc/cpuinfo
processor       : 0
model name      : ARMv7 Processor rev 4 (v7l)
BogoMIPS        : 38.40
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xd03
CPU revision    : 4

processor       : 1-3 are the same

Hardware        : BCM2835
Revision        : a02082
Serial          : 00000000e8f06b5e
Model           : Raspberry Pi 3 Model B Rev 1.2

But on my raspberry pi 4, it does something really weird that I can't explain, am really stumped about and I don't know how to fix.

> cat /proc/cpuinfo 
processor       : 0
model name      : ARMv7 Processor rev 3 (v7l)
BogoMIPS        : 270.00
Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x0
CPU part        : 0xd08
CPU revision    : 3

Hardware        : BCM2835
Revision        : c03111
Serial          : 10000000b970c9df
Model           : Raspberry Pi 4 Model B Rev 1.1

processor       : 1-3 are the same

So I'm asking for help from somebody who knows more about cpus, multithreading, cache coherency and memory barriers than I do. Maybe I'm barking up the wrong tree and you can tell me that, if that's the case. I'm pretty sure the program is okay, I've written lots of complicated multithreaded programs in my life. I've checked it over lots of times, had other people review it as well. This is the first multithreaded kernel module I've written though, so that's where I'm in new territory.

Here's what's going on:

I register a callback with blk_queue_make_request() that handles the read and write requests, I drop all the other ones, returning error (but I never actually get anything but read/write)

    log_kern_debug("bio operation is not read or write: %d", operation);
    bio->bi_status = BLK_STS_MEDIUM; 
    return BLK_QC_T_NONE;

I get the callback from the kernel, I iterate through the segments in the bio. For each segment, I make a request to the userspace application (in another thread) to service the read and write requests. (I'll explain how that works in a minute) and then the original requesting thread goes to sleep. When userspace returns with the data (for read) or success/failure (for write) it hands over the data, wakes up the original requesting thread, and then the original requesting thread returns the bio to the kernel, when all the segments have been serviced:

    bio_endio(bio); // complete the bio, the kernel does the followup callback to the next guy in the chain who wants this bio
    return BLK_QC_T_NONE;

The way the call to userspace works is this: first, the userspace program makes an ioctl call to the kernel module and the kernel module blocks. That thread stays blocked until a request comes in for the block device. The information about the request (read/write, start position, length, etc) gets copied to a userspace-provided buffer with copy_to_user and then the ioctl call is unblocked and returns. Userspace gets the request from the return of the ioctl, does the read or write then makes another ioctl call to the kernel module with the results of the request, and then wakes up the original requesting thread, so it can return the result in the make_request callback, and then the userspace ioctl blocks again waiting for the next request to come in.

So here's the problem. On the raspberry pi 4 only, every once in a while, not all the time, the contents of the memory passed between the two threads doesn't end up looking the same from both thread's point of view. As in when the data gets passed from the userspace-side thread to the original requesting thread (for a read request in this example), the hash of the data (at THE SAME LOCATION IN MEMORY!) is different. I assume this is a cpu cache coherency type-problem, except that I put in calls to mb(), smp_mb() and READ_ONCE() and WRITE_ONCE() and I even tried just plain old sleeping to give the original calling thread's cpu time to notice. It will reliably fail, but not all the time. I don't have any other raspberry pi 4's to test with, but I'm pretty sure the machine is fine because everything else works great. It's something I'm not doing right, but I don't know what.

What follows is a grep of the kern.log and an explanation showing what's going on. Every request going to userspace gets a transaction id. the start pos is the location in the block device to read from or write to. the length is the length of the bio segment to read/write, the crc32 column is the crc32 of the data in the bio segment buffer, (for the length listed, always 4k). The address column is the address of the bio segment buffer the data read from userspace is copied into (that the crc32 comes from) which always the same for a given transaction, and the last column is current->tid.

oper    trans id start pos        length           crc32            address  thread
write:  00000a2d 000000000001d000 0000000000001000 0000000010e5cad0          27240

read0:  00000b40 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31415
read1:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read2:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31415
readx:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read3:  00000b40 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31415

read0:  00000c49 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31417
read1:  00000c49 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read2:  00000c49 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31417
readx:  00000c49 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read3:  00000c49 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31417

read0:  00000d4f 000000000001d000 0000000000001000 000000009b5eeca2 88314387 31419
read1:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read2:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31419
readx:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31392
read3:  00000d4f 000000000001d000 0000000000001000 0000000010e5cad0 88314387 31419

read0:  00000e53 000000000001d000 0000000000001000 000000009b5eeca2 1c6fcd65 31422
read1:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31392
read2:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31422
readx:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31392
read3:  00000e53 000000000001d000 0000000000001000 0000000010e5cad0 1c6fcd65 31422

So the steps in the process are are follows, let's look at the first transaction, id b40 because that one worked correctly. Then we'll look at the second one c49 that didn't work. Transaction ids always increase, the logs above are in chronological order.

1) First the write comes in (trans id a2d) The crc32 of the data written is 10e5cad0. That's the crc32 we expect to see on all reads afterwards until the next write.

2) a read request comes in to the blk_queue_make_request callback handler on thread 31415. At this point I log ("read0") the crc32 of the contents of the bio segment buffer before it is written to, so I can see the before-it-changes value of the bio segment buffer at 88314387.

3) I call copy_to_user the information about the read request. return from the ioctl, userspace processes it, does an ioctl back into the kernel module with the resulting data and that data is copy_from_user()ed to the bio segment buffer (at 88314387). It logs ("read1") the crc32 of the bio segment buffer from userspace thread 31392's point of view. It is the expected 10e5cad0.

4) userspace wakes up the original requesting thread id 31415 now that the data is in the bio segment buffer at 88314387. thread 31415 calculates the crc32 again and logs ("read2") the the value it sees from 31415's point of view. Again as expected it is 10e5cad0.

5) For extra sanity checking (the reason for which will become clear in the next transaction) the userspace thread 31392 does the crc of the bio buffer at 8831487 again, and comes up with the expected value 10e5cad0 and logs it ("readx"). There is no reason it should change, nobody is updating it, it still says 10e5cad0.

6) as a final extra sanity check, the original requesting thread 31415 sleeps for 2ms, and calculates the crc32 again and logs it ("read3"). Everything worked, all is well.

Now let's look at the next transaction id c49. This is a case where the filesystem requested to read the same block twice. I forced this in my test with echo 3 > /proc/sys/vm/drop_caches. I'll start counting steps at 2 so the steps line up from the first example.

2) a read request comes in to the blk_queue_make_request callback handler on thread 31417. At this point I log ("read0") the crc32 of the contents of the bio segment buffer before we write to it. This is the same bio segment buffer from first transaction b40 (memory location 88314387), but apparently it's been written over since we last set it, and that's fine. It also seems to have been set to the same value as it was at the start of transaction b47, the crc32 value is 9b5eeca2. That's fine. We know the inital crc32 value of that bio segment buffer from thread id 31417's point of view before anybody writes to the buffer.

3) I call copy_to_user the information about the read request. return from the ioctl, userspace processes it, does an ioctl back into the kernel module with the resulting data and that data is copy_from_user()ed to the bio segment buffer (at 88314387). It logs ("read1") the crc32 of the bio segment buffer from userspace thread 31392's point of view. It is the expected 10e5cad0. The userspace thread id will always be the same 31392 because the userspace program making the ioctl calls is single threaded.

4) userspace wakes up the original requesting thread id 31417 now that the data SHOULD BE in the bio segment buffer at 88314387. Thread 31417 calculates the crc32 again and logs ("read2") the the value it sees from its (thread 31417's) point of view. But this time, the value is not the expected value 10e5cad0. Instead it is the same value (9b5eeca2) as it was before the request was sent to userspace to update the buffer. It is as if userspace didn't write to the buffer. But it did, because we read it, calculated the crc32 value and logged it in the userspace-side thread 31392. Same memory location, different thread, different perception of the contents of the bio segment buffer at 88314387. Different thread, presumably different cpu, and therefore different cpu cache. Even if I was screwing up the thread blocking and waking up the logs show the order of events, that one thread read the correct value after the other thread misread it.

5) Again the extra sanity checking, the userspace thread 31392 does the crc of the same bio buffer at 8831487 again, gets the same correct value of 10e5cad0 ("readx"). The logs are chronological, so thread id 31392 sees the correct value, after thread id 31417 saw the wrong value. Thread id 31392 comes up with the expected value 10e5cad0 and logs it ("readx").

6) as a final extra sanity check, the original requesting thread 31417 sleeps for 2ms, and calculates the crc32 again and logs it ("read3"), and it still sees the incorrect value 9b5eeca2.

Of the four read transaction I logged above, 1, 3 and 4 worked, and 2 did not. So I figured out, okay, it must be a cache coherency problem. But I added mb() and smp_mb() calls after read1 and before read2, and nothing changed.

I am stumped. I've read the linux kernel memory barrier page

https://www.kernel.org/doc/Documentation/memory-barriers.txt

a number of times, and I figured smp_mb() should just fix everything, but it does not.

I have no idea how to fix this. I can't even think of a lousy workaround. I set the contents of a memory location, and the other thread just doesn't see it. What do I do?

Help? Thanks.

This site about programming. Any Git repository with (simplified) module source available? — 0andriy, Dec 21 '19 at 20:54
With all respect to the Raspberry Foundation and their wonderful products, this story reminded me a bug-hunting experience from test-cases we used in ISO-9000 / TQM pre-acquisition-QA-gate burn-in tests, where some motherboards passed, some failed to correctly store / retain / yield data from certain bitmap-patterns during intensive read/write traffic - yes, the PCB-wire-level spurious transmissions / Xtalk was to be suspected - like when power-density was high or low - as in many zeroes and a single 1 (lost) somewhere in between them or vice versa ( where too much power "filled" a middle 0 ) — user3666197, Dec 21 '19 at 21:06
I have mutli-threading experience with Java only. Your problem reminds me of memory access integrity. There are best practices by your threading library `pthread.h` to share memory sections across threads or in thread context. Thread have their own memory access mechanism. Specific to the library implementation. — Dudi Boy, Dec 21 '19 at 23:42
@0andriy it's really complicated and it would take me a while to work up a trimmed down version that just proves the point, but if it comes to that, I'll do it. Looking for other simpler things to try first. :-) — stu, Dec 22 '19 at 01:48
@user3666197 I'd be the first in line to blame the hardware, as all of my decades of software experience has never shown me anything like this. But I find it hard to imagine that nothing else in the linux kernel does anything that would trigger this problem since it happens so much to me. that said, I ordered another pi4, so we'll see. — stu, Dec 22 '19 at 01:50
@DudiBoy yeah, there's no pthreads here, this is the kernel, you get what they give you. — stu, Dec 22 '19 at 01:51
@stu I share your doubts WHY only "your" kernel-module block-device suffers the flaw. Could be because of weird location in the address-space,that gets (almost) never visited by applications,could be it gets "otherwise" occupied by a code / table in which this particular memory-zone contains static/unimportant/error-tolerant data or a service where "additional" source(s) of "randomness" do not matter(unless some critical failure proves otherwise)- so many reasons why "other" pieces of code might not even detect this.We had this similar dilemma & used MEMTEST86 pre-boot tests to pass/fail — user3666197, Dec 22 '19 at 03:38
@stu One more story on this sort of strange to explain errors. TELCO-OPs, in Europe, used many field-deployed E1 carrier service 2.048 Mbps PDH network devices for driving the E1-circuit devices over the L1-PHY(yes,cables). These devices implemented the CCITT-standard dictated HDB3-line-code, which by itself enabled to detect flaws of its kind for better isolating the problem inside the otherwise passive cable-network. There was a man, who was similarly frustrated from many erroneous devices & who finally back-traced the flawed HDB3-logic behaviour down to a 1 bad soldering point on the PCB. — user3666197, Dec 22 '19 at 03:49
@user3666197 Thanks for the wisdom. I love good stories from experience, especially when it's related to my field, but a bit out of my arena. Happy to listen to any more wise tales you have. Good tip on the try-other-locations. Not sure how much control I have over that, but I can look into it, and you're right, high memory where kernel stuff lives is going to be less variable than userspace. I thought of a few more variations in code I can try, but the new pi4 will tell all one way or the other. Thanks again for the insight and experience. — stu, Dec 22 '19 at 17:47
@stu Always welcome,Sir,Fair to notice your DeadPelican stories are indeed enlightening & worth a read + a loads of further thinking about Do-s & Don't-s (a deep believer in Human Intelligence here,but TlDr;) ***Nota bene:** almost sure you have already thought about this,yet any second RPi4 need not "decide" the issue,in case the error is caused by a location-based PCB/process-trouble the manufacturing will but "repeat" by itself (for which even the memory chips,the next suspect,need not be from a same foundry batch or not being of the same origin at all* Pleasure to meet your,Sir+Merry Xmas! — user3666197, Dec 22 '19 at 21:29
@user3666197 yeah any number of odd things can go wrong. I'm going to make a $60 bet that I qont be unlucky twice. I'll let you know what happens. thanks for the kind words. Happy xmas and all other holidays. — stu, Dec 23 '19 at 23:52
For anybody following along at home... the second pi4 arrived and I tried it out, and... the same exact thing happened. So either I'm REALLY unlucky getting 2 broken pi4s, or there's a bug in my program I'm not seeing, or it's a consistent design flaw in the pi4. So I'm going to write a simple proof of concept and send it off to the pi people and see if they have anything to say, or can show me where I'm going wrong. — stu, Dec 26 '19 at 17:50
Oooops, @stu , you've lost the bet. OTOH, the situation is better in the sense the behaviour is not random ( the worst bug-catcher's nightmare to hunt a bug inside randomly moving sands - only sometimes, at random, having an error to diagnose, many times, if not most of the other time not ) **Given all above, my guess is the option *C** ( be it a straight design flaw, or a systematically flawed board / system-assembly manufacturing process, who actually injects the incompliant erroneous behaviour )* My fingers are crossed for the RPi-community wider field test/eval + **P.F. 2020!** — user3666197, Dec 27 '19 at 17:54
I'm inclined to agree. I thought I figured out a workaround that involved a few intermediate memcpys, and it worked more reliably, but still eventually failed, so the proof of concept should be easier to write. Again thanks for all the help and perspective. If I find out or work anything else out, I'll let you know. — stu, Dec 28 '19 at 14:27

stu · Accepted Answer · 2020-02-26T02:17:43.640

So miracle of miracles I came across the answer totally by accident. I'd like to share in case anybody else comes across this and similarly bangs their heads for months.

In a totally unrelated change for another system I'm working on using this block driver, I made a change today and tried it on the pi4 and like magic it's all working.

What was the change? Not at all where I was looking....

So I register a callback with blk_queue_make_request not blk_init_queue. I don't handle the request queue, I handle the bios in the block request directly.

In so doing, you are told: https://www.kernel.org/doc/htmldocs/kernel-api/API-blk-queue-make-request.html

"The driver that does this must be able to deal appropriately with buffers in “highmemory”. This can be accomplished by either calling __bio_kmap_atomic to get a temporary kernel mapping, or by calling blk_queue_bounce to create a buffer in normal memory. "

Well, I had been achieving that with calls to kmap_atomic when I wanted to get the buffer. Today I read that there's a limited number of slots for those memory mappings and you should only call that if you're in an interrupt context and can't go to sleep, because the kmap_atomic call pulls from the reserved pile so it doesn't have to allocate on the call and possibly be put to sleep.

But my kernel module can sleep, so I changed the call to kmap() and... like magic... it's working.

So I'm thinking the failure cases were where kmap_atomic was failing and I wasn't catching or noticing, or possibly, something is wrong with kmap_atomic on the pi4 or on the interaction between the kernel in that case or something. I'll play more and see if I can tease out what's going on, but the trick is something's wrong with the way I was calling kmap_atomic.

After a little playing around...

Feb 25 21:12:46 pi205 kernel: [86149.193899][13905] kernel:    buffer after kmap_atomic ffefd000
Feb 25 21:12:46 pi205 kernel: [86149.193912][13905] kernel:    buffer after kmap        bfe84000

so when kmap_atomic returns a different value than kmap, is when the other thread doesn't see the memory correctly. I read something saying there's a per-cpu cache of these kmap_atomic mappings, which would explain that behaviour, if that's what it is.

I have a cpu cache coherency-looking problem that I can't figure out how to fix. Two cpus see different contents of the same memory

1 Answers1