0

I have a memory-heavy application which is supposed to run with low latency and with constant speed, but in practice it has poor performance during the first few seconds of startup. This appears to be because the initial memory accesses triggers page faults which have significant performance implications.

I would like to try preallocating a single large block of memory, paging it all in (via mlock() or just by touching each byte), and then using a custom malloc()/free() implementation to ensure that all further allocations are done from within this block.

I am aware of numerous custom memory allocators (TCMalloc, Hoard, jemalloc, etc) but it is not clear to me whether they can be backed by user-provided memory, or whether they always perform their internal allocations from the OS. Does anyone have any insight or recommendations here?

To be clear, I am not looking for a memory pooling system (which would be for reusing small objects). The custom implementation of malloc()/free() should be able to perform any size allocation while limiting fragmentation of its backing store and following other best practices.

Edit based on comments: I do not expect to make the system faster - I just want to move the slow part (allocation, initial page faults) to the start of the process, and then do the real computation work once the system is 'primed'.

Thanks!

David Williams
  • 753
  • 1
  • 4
  • 11
  • 1
    "First few seconds" is rather short time. Especially if the total runtime of your program stretches over several minutes, hours or even days. Are these "few seconds" really such a big problem for your larger system? Also, even if you allocate a large memory area and "touch" bytes in each page, you will get several page-faults to actually create and map the pages to your process, so the performance gain might not be as big as you expect. – Some programmer dude Jul 15 '22 at 08:14
  • 1
    Did you already consider [`std::pmr::monotonic_buffer_resource`](https://en.cppreference.com/w/cpp/memory/monotonic_buffer_resource)? – paolo Jul 15 '22 at 08:15
  • 1
    I'm not sure how you expect this to help. If the OS needs time to zero out pages, nothing you do can help with that. And eventually, all memory comes from the OS. You should profile the problem first to find the source of the delays. Finally, "fragmentation of its backing store"? That does not sound like a real problem to me. – MSalters Jul 15 '22 at 08:15
  • 1
    @Someprogrammerdude It depends on the system, but it can impact performance for the first 20-30 seconds if there are a lot of threads (each thread allocates memory once it gets assigned some work). One of the main challenges it that this makes calibrating the system difficult, because for calibration purposes I would ideally run for a relatively short time. So I'd like to get the system 'primed' and as ready a it can be before I start the actual processing. – David Williams Jul 15 '22 at 08:25
  • 1
    @MSalters No, I don't expect to make it faster overall. I would just like to move the slow operations (allocation, paging) to the start of the process, so that they do not affect performance once I start the real work. As for fragmentation, I believe it is a real problem that memory management systems need to address (though not something which the user is typically exposed to). I'm not well-versed there though. – David Williams Jul 15 '22 at 08:39
  • @paolo I believe that is more related to controlling the deallocation of memory rather than helping with the initial allocation and paging overheads. – David Williams Jul 15 '22 at 08:43

1 Answers1

1

A bit late to the party.

dlmalloc is one choice that can be backed by pre-allocated memory. You can find it here. You may just need to add some extra definitions in the beginning to force it to use your pre-allocated memory rather than call the system mmap, you can refer to the nice documentation at the beginning of the file.

mewais
  • 1,265
  • 3
  • 25
  • 42
  • 1
    Thanks for the answer. I've moved on to the next project now so I won't get a chance to test this, but I'll accept it because it seems reasonable and is the only answer I got. – David Williams Oct 12 '22 at 12:20
  • For the occasional googler, [here](https://github.com/s5z/zsim/blob/master/src/g_heap/dlmalloc.h.c)'s also an example of it being used. In this instance, the user allocated a huge shared memory heap, then uses dlmalloc to allocate from it as needed by their application. – mewais Oct 13 '22 at 04:30