I'm trying to allocate and use 100 MB of data (size_t s = 100 * 1024 * 1024
) and measured different ways in C++:
Raw allocation without init:
// < 0.01 ms
auto p = new char[s];
C++ zero-init:
// 20 ms
auto p = new char[s]();
Manual zero-init:
// 20 ms
auto p = new char[s];
for (auto i = 0; i < s; ++i)
p[i] = 0;
This is not a limitation of my memory as demonstrated by writing the memory again:
// 3 ms
std::memset(p, 0xFF, s);
I also tried std::malloc
and std::calloc
but they show the same behavior. calloc
returns memory that is zero-initialised but if I do a memset
afterwards that still takes 20 ms.
As far I understand it uninitialized memory is fast to allocate because it doesn't actually touch the memory. Only when I access it are the pages allocated to my program. The 3 ms for setting 100 MB correspond to ~35GB/s which is OK-ish. The 20 ms seem to be overhead when triggering page faults.
The funny thing is that this seems to be compute overhead. If I initialize it with multiple threads it gets faster:
// 6-10 ms
auto p = new char[s];
#pragma omp parallel for
for (auto i = 0; i < s; ++i)
p[i] = 0;
My question: is there a way to not only allocate memory but also immediately allocate all pages so that no further page faults arise when accessing it?
I would like to avoid using huge pages if possible.
(Measurement with std::chrono::high_resolution_clock
)
This was done on my desktop system (5Ghz i9, 3600 MHz DDR4, Linux Mint 19, 4.15.0-45-generic kernel) with Clang 7 (-O2 -march=native
) though, looking at the assembly, the compiler is not the problem.
EDIT: This is a simplified example, in my actual application I need to initialize it with a different value than 0 but that doesn't change the timing at all.