Twice as many page faults when reading from a large malloced array instead of just storing?

Question

I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count. if I use

 ptr[i+4096] = 'A'

I got 25,722 page-faults with perf tool, which is what I expected, but if I use

tmp = ptr[i+4096]

instead, the page-faults doubled to 51,322 I don't how to explain it. Below is the complete code. Thanks!

void do_something() {
    int i;
    char* ptr;
    char tmp;
    ptr = malloc(100*1024*1024);
    int j = 0;
    int k = 0;

    for (i = 0; i < 100*1024*1024; i+=4096) {

       //ptr[i+4096] = 'A' ;
       tmp = ptr[i+4096];

       for (j = 0 ; j < 4096; j++)
           ptr[i+j] = (char) (i & 0xff); // pagefault
    }
    free(ptr);
}

int main(int argc, char* argv[]) {
    do_something();
    return 0;
}

Machine Info: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz Stepping: 2 CPU MHz: 3096.188 BogoMIPS: 6197.81 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39

3.10.0-514.32.3.el7.x86_64 #1

And you compiled without optimization, I guess? Because you didn't use `volatile` on anything, so it would all optimize away if compile with normal optimization enabled (-O2 or -O3). So either way you're storing to the stack, because your loop counters won't be optimized into registers. Anyway, I could see an extra dTLB miss from storing to the stack after a page-fault modifies the HW page tables to wire the new page (if Meltdown mitigation invalidated the TLBs), but not an extra page fault. IDK why that would happen. — Peter Cordes, Sep 05 '18 at 16:37
In case it matters, what Linux version, and what architecture? Is it x86 with Meltdown / Spectre mitigation? (And BTW, with optimization the `tmp = ptr[i+4096]` loads would optimize away, but probably not the store version because gcc doesn't detect malloc/free without passing the pointer anywhere as being a dead store.) — Peter Cordes, Sep 05 '18 at 16:48
I built it without any optimization. @Peter, Are you able to repeat it on your linux? — Yunzhou Wu, Sep 05 '18 at 17:29
NOTE the `ptr[i+4096]` will cause an out-of-range memory access on the last iteration of the loop. — Leo K, Sep 05 '18 at 17:49

BeeOnRope · Accepted Answer · 2018-09-06T17:19:51.913

malloc() will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap. Such pages are generally allocated lazily: no actual page is allocated until the first access.

What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.

When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc).

Note that the above is a description of how newly allocated pages work in Linux - when you use malloc there is another layer: malloc will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.

Oh right, and the buffer isn't page-aligned, so the first access to each page is a read at the end of the char-at-a-time loop, not one-per-page write. (glibc `malloc` usually keeps the first 16 bytes of a page for bookkeeping, so `ptr` is probably something like `0x...0010`) — Peter Cordes, Sep 06 '18 at 16:54
@PeterCordes - I don't think alignment matters for this particular code: the read happens at `i + 4096` every loop, and then the rest of the loop writes up to `i + 4095`, so the reads are always running ahead of the writes and so every page will get a read access first, regardless of the alignment of the block returned by `malloc`. — BeeOnRope, Sep 06 '18 at 17:13
oops, I was remembering things totally backwards from looking at it yesterday. — Peter Cordes, Sep 06 '18 at 17:15

Twice as many page faults when reading from a large malloced array instead of just storing?

1 Answers1