How to properly do page table warming in C++?

Question

I have a big shared memory allocated (4 MB) between two processes. My process-1 writes on this memory pool, using it as a circular buffer, by writing in chunks of 256 bytes, one after the other. My process-2 reads from the memory. I am using locks for synchronization. I was measuring the write time, and I could see spikes at every 16th operation. My guess is that is because of accessing a new page (since 16*256 bytes = 4096 bytes).

Since this is happening at a critical point in my program and results in high latency, I decided to warm-up my page table of process-1 by accessing this mempool just after the constructor allocates/binds to the shared memory.

//global var
dummy_byte = 0;

for (int i=0; i<16*1024; i++)
{
    dummy_byte ^= buffer[i*256]; 
}

The objective is to access a byte from each of the chunks, so that all the pages are fetched. I'm using a global variable for this, otherwise the compiler optimises and removes my loop (because noone reads the saved value). I verified later using objdump that this code doesn't get removed.

The problem I'm facing is that the latency spikes are still happening. While playing with the warmup logic, I tried this:

for (int i=0; i<16*1024; i++)
{
    buffer[i*256] = 0;
}

I found out that this results in no latency spikes in the critical point. The problem is that I do not want to write junk into the buffer, since there might be something useful present in there, and doing a read-write of the same byte my result in race condition because another process might be reading from the shared memory.

I want to know:

is it indeed happening because of tlb? or something else?
if it is tlb, why is it that read is unable to warmup the pages, but write is able to?
is there anything more that can be experimented with?

Regarding the variable: You could make it volatile and reside in the inner-most scope. — bitmask, Feb 20 '23 at 10:40
Instead of using memory accesses to inform the OS in a roundabout way, why not straight up use `mlock` to tell it to keep that memory resident at all times? (Or `mmap` with the `MAP_LOCKED` flag.) — Botje, Feb 20 '23 at 10:43
The compiler is allowed to optimise out reads into dummy whose value is never used. If you further use it (or at least pretend well enough) then it does not. — Öö Tiib, Feb 20 '23 at 12:01
@ÖöTiib An easy option to prevent the reads from being optimized out would be to read through a `volatile`-qualified value - reading `volatile` values [counts as a side-effect](https://eel.is/c++draft/intro.execution#7), so compilers are never allowed to optimize those out (even if the read value is unused) - for example [like in this godbolt](https://godbolt.org/z/P9dfdG7b8) (note the 6 identical loads in the assembly output that haven't been optimized out) — Turtlefight, Feb 21 '23 at 06:09
@Botje I tried using mlock(), and locked the entire shared memory (successfully locked, return value 0), but still the same problem persists. The latency spikes are still present. It maybe the case that the hypothesis for the spikes is wrong in the first place, and its happening due to some entirely different reason. — W1nTer003, Feb 24 '23 at 06:39

How to properly do page table warming in C++?

0 Answers0