The Problem
In the course of attempting to reduce/eliminate the occurrence of minor pagefaults in an application, I discovered a confusing phenomenon; namely, I am repeatedly triggering minor pagefaults for writes to the same address, even though I thought I had taken sufficient steps to prevent pagefaults.
Background
As per the advice here, I called mlockall
to lock all current and future pages into memory.
In my original use-case (which involved a rather large array) I also pre-faulted the data by writing to every element (or at least to every page) as per the advice here; though I realize the advice there is intended for users running a kernel with the RT patch, the general idea of forcing writes to thwart COW / demand paging should remain applicable.
I had thought that mlockall
could be used to prevent minor page faults. While the man page only seems to guarantee that there will be no major faults,various other resources (e.g. above) state that it can be used to prevent minor page faults as well.
The kernel documentation seems to indicate this as well. For example, unevictable-lru.txt and pagemap.txt state that mlock()
'ed pages are unevictable and therefore not suitable for reclamation.
In spite of this, I continued to trigger several minor pagefaults.
Example
I've created an extremely stripped down example to illustrate the problem:
#include <sys/mman.h> // mlockall
#include <stdlib.h> // abort
int main(int , char **) {
int x;
if (mlockall(MCL_CURRENT | MCL_FUTURE)) abort();
while (true) {
asm volatile("" ::: "memory"); // So GCC won't optimize out the write
x = 0x42;
}
return 0;
}
Here I repeatedly write to the same address. It is easy to see (e.g. via cat /proc/[pid]/status | awk '{print $10}'
) that I continue to have minor pagefaults long after the initialization is complete.
Running a modified version* of the pfaults.stp
script included in systemtap-doc
, I logged the time of each pagefault, address that triggered the fault, address of the instruction that triggered the fault, whether it was major/minor, and read/write. After the initial faults from startup and mlockall
, all faults were identical: The attempt to write to x
triggered a minor write fault.
The interval between successive pagefaults displays a striking pattern. For one particular run, the intervals were, in seconds:
2, 4, 4, 4.8, 8.16, 13.87, 23.588, 40.104, 60, 60, 60, 60, 60, 60, 60, 60, 60, ...
This appears to be (approximately) exponential back-off, with an absolute ceiling of 1 minute.
Running it on an isolated CPU has no impact; neither does running with a higher priority. However, running with a realtime priority eliminates the pagefaults.
The Questions
- Is this behavior expected?
1a. What explains the timing? - Is it possible to prevent this?
Versions
I'm running Ubuntu 14.04, with kernel 3.13.0-24-generic
and Systemtap version 2.3/0.156, Debian version 2.3-1ubuntu1 (trusty)
. Code compiled with gcc-4.8
with no extra flags, though optimization level doesn't seem to matter (provided the asm volatile
directive is left in place; otherwise the write gets optimized out entirely)
I'm happy to include further details (e.g. exact stap
script, original output, etc.) if they will prove relevant.
*Actually, the vm.pagefault
probe was broken for my combination of kernel and systemtap because it referenced a variable that no longer existed in the kernel's handle_mm_fault
function, but the fix was trivial)