0

I'm trying to allocate and use 100 MB of data (size_t s = 100 * 1024 * 1024) and measured different ways in C++:

Raw allocation without init:

// < 0.01 ms
auto p = new char[s]; 

C++ zero-init:

// 20 ms
auto p = new char[s](); 

Manual zero-init:

// 20 ms
auto p = new char[s];
for (auto i = 0; i < s; ++i) 
    p[i] = 0;

This is not a limitation of my memory as demonstrated by writing the memory again:

// 3 ms
std::memset(p, 0xFF, s);

I also tried std::malloc and std::calloc but they show the same behavior. calloc returns memory that is zero-initialised but if I do a memset afterwards that still takes 20 ms.

As far I understand it uninitialized memory is fast to allocate because it doesn't actually touch the memory. Only when I access it are the pages allocated to my program. The 3 ms for setting 100 MB correspond to ~35GB/s which is OK-ish. The 20 ms seem to be overhead when triggering page faults.

The funny thing is that this seems to be compute overhead. If I initialize it with multiple threads it gets faster:

// 6-10 ms
auto p = new char[s];
#pragma omp parallel for
for (auto i = 0; i < s; ++i) 
    p[i] = 0;

My question: is there a way to not only allocate memory but also immediately allocate all pages so that no further page faults arise when accessing it?

I would like to avoid using huge pages if possible.

(Measurement with std::chrono::high_resolution_clock)

This was done on my desktop system (5Ghz i9, 3600 MHz DDR4, Linux Mint 19, 4.15.0-45-generic kernel) with Clang 7 (-O2 -march=native) though, looking at the assembly, the compiler is not the problem.

EDIT: This is a simplified example, in my actual application I need to initialize it with a different value than 0 but that doesn't change the timing at all.

Artificial Mind
  • 945
  • 6
  • 19
  • 2
    Are you sure you are not a *victim of premature optimization*? Also parallelizing a loop and a get better performance, when the situation is right, is expected, not funny. :) PS: The title is misleading... – gsamaras Feb 09 '19 at 17:37
  • Yep I'm profiling my application and this is the next hotspot to optimize. My complete program could get about 30% faster if that allocation performance was closer to the memset. The reason I called this funny is because this program should be clearly memory limited, not compute limited. Do you have a better title? I have newly allocated memory and want to initialize it as fast as possible. – Artificial Mind Feb 09 '19 at 17:42
  • 2
    You can do it on windows and linux I think, directly using system call. On linux mmap with MAP_POPULATE to pre-faults the page, the page are always zeroed (or not zeroed for kernel configurations that target embedded devices). – Oliv Feb 09 '19 at 17:48
  • Tell the compiler to print the assembly language for all the cases. Some processors may have specialized instructions for setting memory blocks to a single value. Check your processor's assembly language and verify that the compiler is optimizing accordingly. – Thomas Matthews Feb 09 '19 at 17:48
  • You may want to see if your processor supports parallel execution. This is one of those cases that can use parallel instructions. Check your compiler documentation for instructions on how to specify a loop in parallel. – Thomas Matthews Feb 09 '19 at 17:50
  • I checked the assembly (https://godbolt.org/z/fihC9j), looks good so far. It's also really an issue of the first access of the new memory. When I use the same loop a second time it's 3 ms instead of 20. – Artificial Mind Feb 09 '19 at 17:51
  • 1
    The kernel must find physical memory and map it to your process one way or another. It doesn't make a lot of difference if it does this at allocation time or at first access time. – n. m. could be an AI Feb 09 '19 at 17:52
  • Also notice that on linux (probably also on windows), for large chunk of memory (>16*4096 B), malloc directly call mmap, and as I said mmap will provides zeroed memory! There is no point in trying to zeroes already zeroed memory. – Oliv Feb 09 '19 at 17:53
  • If you are setting memory to 0, you may want to investigate using `memset`. The `memset` function should be optimized for setting memory to a specific value (although it may have extra code to handle boundary conditions). – Thomas Matthews Feb 09 '19 at 17:53
  • Edited the post: 0 was a simplification, I have different values in my actual application. Though it doesn't matter for the performance. @ThomasMatthews I tested `memset` (see also the godbolt link) and it doesn't make a difference. – Artificial Mind Feb 09 '19 at 17:59
  • It may be faster when you interpret the array as long array and set the long values to zero. – jjj Feb 09 '19 at 18:09
  • @jjj already tried that and didn't help. `memset` should be the fastest way to set memory anyway. – Artificial Mind Feb 09 '19 at 18:18
  • 1
    `memset` may not be the fatest way since each page fault will make the system zero the memory, and then your code will set the memory to a different value. This is a waste. On linux there is a way to by pass that unusefull zeroing step. – Oliv Feb 09 '19 at 18:20
  • I don't know how you did your testing but I recommend taking a checksum of the buffer and printing that out after the test to make sure the compiler doesn't optimize the entire test away if it detects you don't use the buffer. – Galik Feb 09 '19 at 18:51
  • Your data seems to indicate that all methods cost 20ms for the first genuine access to the memory, and 3ms thereafter. What leads you to believe that there is a secret way to avoid the page management overhead? – Ben Voigt Feb 09 '19 at 19:45
  • @BenVoigt because multithreaded access is faster. And because there might be OS APIs for directly allocating pages. – Artificial Mind Feb 09 '19 at 20:40
  • There are OS APIs but you tagged for portable C++ and not for any specific OS. If you want to know about non-portable approaches, you'd benefit from saying that in your question. – Ben Voigt Feb 09 '19 at 20:43
  • You could read one byte every page-size bytes (read one byte, skip ahead 4096 bytes, read one byte, etc). Your multithreaded version is faster because one (or more) threads are clearing memory while other threads are stalled waiting for the paging operation (so the clear is mostly free). – 1201ProgramAlarm Feb 09 '19 at 21:19

0 Answers0