8

How can I allocate memory on Linux without overcommitting, so that malloc actually returns NULL if no memory is available and the process doesn't randomly crash on access?

My understanding of how malloc works:

  1. The allocator checks the freelist if there is free memory. If yes, the memory is allocated.
  2. If no, new pages are allocated from the kernel. This would be where overcommit can happen. Then the new memory is returned.

So if there is a way to get memory from the kernel that is immediately backed by physical memory, the allocator could use that instead of getting overcommitted pages, and return NULL if the kernel refuses to give more memory.

Is there a way this can be done?

Update:

I understand that this cannot fully protect the process from the OOM killer because it will still be killed in an out of memory situation if it has a bad score, but that is not what I'm worried about.

Update 2: Nominal Animal's comment gave me the following idea of using mlock:

void *malloc_without_overcommit(size_t size) {
    void *pointer = malloc(size);
    if (pointer == NULL) {
        return NULL;
    }
    if (mlock(pointer, size) != 0) {
        free(pointer);
        return NULL;
    }

    return pointer;
}

But this is probably quite slow because of all the system calls, so this should probably be done at the level of the allocator implementation. And also it prevents making use of swap.

Update 3:

New idea, following John Bollingers's comments:

  1. Check if enough memory is available. From what I understand this has to be checked in /proc/meminfo in the MemFree and SwapFree values.
  2. Only if enough space is available (plus an additional safety margin), allocate the memory.
  3. Find out the pagesize with getpagesize and write one byte to the memory every pagesize, so that it gets backed by physical memory (either RAM or swap).

I also looked more closely at mmap(2) and found the following:

MAP_NORESERVE

Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available. See also the discussion of the file /proc/sys/vm/overcommit_memory in proc(5). In kernels before 2.6, this flag only had effect for private writable

Does this imply that mmaping with ~MAP_NORESERVE will completely protect the process from the OOM killer? If so, this would be the perfect solution, as long as there is a malloc implementation, that can work directly on top of mmap. (maybe jemalloc?)

Update 4: My current understanding is that ~MAP_NORESERVE will not protect against the OOM killer but at least against segfaulting on first write to the memory.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
FSMaxB
  • 2,280
  • 3
  • 22
  • 41
  • No, overcommit is not as simple as that. Overcommit is not a property of the memory, but a policy on how to manage the mapping between virtual memory and actual RAM available. With overcommit, there is more virtual memory than RAM. Without, virtual memory is limited to total RAM. The only way to ensure some allocation is always backed by RAM is to lock the pages in memory ([`mlock()`](http://man7.org/linux/man-pages/man2/mlock.2.html)). – Nominal Animal Feb 02 '18 at 14:55
  • @NominalAnimal mlock is a good clue, although there is one problem, mlock doesn't allow swapping pages to a hard drive. – FSMaxB Feb 02 '18 at 14:59
  • 3
    @NominalAnimal *Without [overcommit], virtual memory is limited to total RAM.* Available swap space adds to available virtual memory, too. – Andrew Henle Feb 02 '18 at 15:06
  • 1
    `mlock(pointer, size)` probably isn't workable - `mlock()` will lock the page(s), and you're still using `malloc()`. You'd have to also try to somehow keep track of what pages needed to be unlocked, because `munlock()` also operates on entire pages. – Andrew Henle Feb 02 '18 at 15:09
  • @AndrewHenle But won't free actually just give the entire page back, so that munlock won't matter? (once the page is not used anymore of course) – FSMaxB Feb 02 '18 at 15:12
  • One workable solution would be to 1) use an entirely `brk()`/`sbrk()`-based `malloc()` library and 2) interpose on `brk()` and `sbrk()` using LD_PRELOAD to create the heap's backing virtual RAM and lock it into place. I did that for a customer about a decade ago on a Solaris system where I used intimate shared memory (`SHM_SHARE_MMU`) to create a heap that couldn't be swapped out. – Andrew Henle Feb 02 '18 at 15:16
  • 1
    @FSMaxB `free()` doesn't have to "give back" anything. Once the heap memory is allocated to your process, your process in general keeps it forever. The standard heap routines on Linux do use a mixed-mode allocator under the hood, though, where larger allocations may be satisfied with dedicated `mmap()` calls, while smaller ones may use `sbrk()`/`brk()`-obtained RAM or `mmap()` memory. Linux's mixed-mode allocator does make solving your particular issue more difficult. – Andrew Henle Feb 02 '18 at 15:19
  • I often use a file-backed memory map (mapped using `MAP_SHARED | MAP_NORESERVE`) which causes the memory map to be backed only by the file, not swap, for datasets larger than available memory (RAM+SWAP). Avoids the entire overcommit issue, too. (@AndrewHenle: You're right; I tried to simplify things a bit too much in my comment, as I tried to emphasize that overcommit is a *policy*, not a flag or property of some memory, as I think that distinction is extremely important to understand here.) – Nominal Animal Feb 02 '18 at 15:19
  • 2
    If possible, you could just disable overcommit for the whole system by setting the sysctl [`vm.overcommit_memory`](https://www.kernel.org/doc/Documentation/vm/overcommit-accounting) to 2. – Cristian Ciupitu Feb 02 '18 at 15:25
  • I didn't know about these mixed allocations, I kind of assumed that it is just mmaping pages for everything. – FSMaxB Feb 02 '18 at 15:29
  • I'm not confident that I can implement this myself on such a low level. And I explicitly don't want to turn off overcommiting in the entire system. – FSMaxB Feb 02 '18 at 15:30
  • 2
    *I explicitly don't want to turn off overcommiting in the entire system.* -- then what's the point? Memory overcommit is a whole-system issue. You cannot usefully avoid it on a per-process basis, because even if your process's allocation succeeds without ovecommit, the next allocation *by any process* may put the system into an overcommit state, affecting your process as much as any other. – John Bollinger Feb 02 '18 at 15:43
  • Maybe you want something different, such as avoiding your process allocating more than x% of the system's total RAM, or y% of the RAM + swap. If so, that's not an overcommit problem. – John Bollinger Feb 02 '18 at 15:46
  • @JohnBollinger Yeah, maybe it is better to first check if enough pysical memory space is available (both RAM + swap), then allocate and write to all the allocated pages so that they are actually backed by RAM or swap. – FSMaxB Feb 02 '18 at 15:52
  • Yes, @FSMaxB, I was prepared to suggest more or less that. It won't necessarily get you an error from `malloc()`, but it will either trigger the OOM Killer or ensure that the system backs all your allocated memory with RAM and / or swap, and it will do so without locking anything into RAM. – John Bollinger Feb 02 '18 at 15:58
  • @JohnBollinger *Memory overcommit is a whole-system issue.* Yes, that's what I meant about that my process may still killed by the OOM killer and that not being my concern. I just don't want to be the one that accidentally triggers the OOM killer. (I still might be killed by it, but only due to other processes behavior) – FSMaxB Feb 02 '18 at 16:02
  • @FSMaxB, if memory overcommit is enabled, then you cannot effectively protect against your process triggering the OOM Killer. Even if you don't allocate anything at all, it is possible for the attempt to load and run your program to trigger it. – John Bollinger Feb 02 '18 at 16:09

2 Answers2

6

How can I allocate memory on Linux without overcommitting

That is a loaded question, or at least an incorrect one. The question is based on an incorrect assumption, which makes answering the stated question irrelevant at best, misleading at worst.

Memory overcommitment is a system-wide policy -- because it determines how much virtual memory is made available to processes --, and not something a process can decide for itself.

It is up to the system administrator to determine whether memory is overcommitted or not. In Linux, the policy is quite tunable (see e.g. /proc/sys/vm/overcommit_memory in man 5 proc. There is nothing a process can do during allocation that would affect the memory overcommit policy.
 

OP also seems interested in making their processes immune to the out-of-memory killer (OOM killer) in Linux. (OOM killer in Linux is a technique used to relieve memory pressure, by killing processes, and thus releasing their resources back to the system.)

This too is an incorrect approach, because the OOM killer is a heuristic process, whose purpose is not to "punish or kill badly behaving processes", but to keep the system operational. This facility is also quite tunable in Linux, and the system admin can even tune the likelihood of each process being killed in high memory pressure situations. Other than the amount of memory used by a process, it is not up to the process to affect whether the OOM killer will kill it during out-of-memory situations; it too is a policy issue managed by the system administrator, and not the processes themselves.
 

I assumed that the actual question the OP is trying to solve, is how to write Linux applications or services that can dynamically respond to memory pressure, other than just dying (due to SIGSEGV or by the OOM killer). The answer to this is you do not -- you let the system administrator worry about what is important to them, in the workload they have, instead --, unless your application or service is one that uses lots and lots of memory, and is therefore likely to unfairly killed during high memory pressure. (Especially if the dataset is sufficiently large to require enabling much larger amount of swap than would otherwise be enabled, causing a higher risk of a swap storm and late-but-too-strong OOM killer.)

The solution, or at least the approach that works, is to memory-lock the critical parts (or even the entire application/service, if it works on sensitive data that should not be swapped to disk), or to use a memory map with a dedicated backing file. (For the latter, here is an example I wrote in 2011, that manipulates a terabyte-sized data set.)

The OOM killer can still kill the process, and a SIGSEGV still occur (due to say an internal allocation by a library function that the kernel fails to provide RAM backing to), unless all of the application is locked to RAM, but at least the service/process is no longer unfairly targeted, just because it uses lots of memory.

It is possible to catch the SIGSEGV signal (that occurs when there is no memory available to back the virtual memory), but thus far I have not seen an use case that would warrant the code complexity and maintenance effort required.
 

In summary, the proper answer to the stated question is no, don't do that.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
1

From the discussion in the comments, it appears calling

mlockall( MCL_CURRENT | MCL_FUTURE );

upon process start would satisfy the requirement for malloc() to return NULL when the system can not actually provide memory.

Per the Linux mlockall() man page:

mlockall() and munlockall()

mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.

The flags argument is constructed as the bitwise OR of one or more of the following constants:

   MCL_CURRENT Lock all pages which are currently mapped into the
               address space of the process.

   MCL_FUTURE  Lock all pages which will become mapped into the address
               space of the process in the future.  These could be, for
               instance, new pages required by a growing heap and stack
               as well as new memory-mapped files or shared memory
               regions.

   MCL_ONFAULT (since Linux 4.4)
               Used together with MCL_CURRENT, MCL_FUTURE, or both.
               Mark all current (with MCL_CURRENT) or future (with
               MCL_FUTURE) mappings to lock pages when they are faulted
               in.  When used with MCL_CURRENT, all present pages are
               locked, but mlockall() will not fault in non-present
               pages.  When used with MCL_FUTURE, all future mappings
               will be marked to lock pages when they are faulted in,
               but they will not be populated by the lock when the
               mapping is created.  MCL_ONFAULT must be used with either
               MCL_CURRENT or MCL_FUTURE or both.

If MCL_FUTURE has been specified, then a later system call (e.g., mmap(2), sbrk(2), malloc(3)), may fail if it would cause the number of locked bytes to exceed the permitted maximum (see below). In the same circumstances, stack growth may likewise fail: the kernel will deny stack expansion and deliver a SIGSEGV signal to the process.

Note that using mlockall() in this manner might have other, unexpected consequences. Linux has been developed assuming memory overcommit is available, so something as simple as calling fork() after mlockall() might run into issues.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
  • But this seems to be tied to RAM again. This is a major overkill, if all the OP wants is to avoid overcommitting. Overcommitting is not about RAM (as you correctly said yourself in the comments). – AnT stands with Russia Feb 02 '18 at 15:46
  • Hm, reading the parts of the man page makes me wonder what that limit on allowed locking typically is. Maybe mlock is not the best approach after all. – FSMaxB Feb 02 '18 at 15:50
  • @AnT *But this seems to be tied to RAM again.* Of course it's tied to RAM. OP's requirement is that `malloc()` return `NULL` when there's no RAM actually available. – Andrew Henle Feb 02 '18 at 16:14
  • @FSMaxB *what that limit on allowed locking typically is* Offhand, I'd guess it's a tuneable parameter that default to something like 50% of RAM. – Andrew Henle Feb 02 '18 at 16:15
  • @Andrew Henle: But where does he say that? Firstly, the original question is clearly about overcommitting and overcommitting only. Secondly, if they are working in a RAM-only setup then the matter of locking is moot. If they are working in a swap-enabled setup then there's no connection between `malloc` and "RAM available" at all. And in the comments the OP is clearly talking about RAM+swap. – AnT stands with Russia Feb 02 '18 at 16:28
  • @AnT *But where does he say that?* In the first sentence of his post: "... so that malloc actually returns NULL if no memory is available" That's the actual concrete requirement stated, and as far as I can tell `mlockall()` appears to be about the only way to ensure a single process actually gets *and keeps* the RAM it asks for on a server that permits memory overcommit. Yes, it's likely to also have unintended consequences on an OS designed from the ground up to overcommit memory. – Andrew Henle Feb 02 '18 at 16:35
  • @Andrew Henle: In the context of [C], [linux] and `malloc` "memory" is not "RAM". The post is about the problem of overcommitting. None of this has any direct relation to RAM, unless the OP is talking about RAM-only setups. (Ans they are not.) – AnT stands with Russia Feb 02 '18 at 16:51
  • @AnT *"memory" is not "RAM"* Then what is it, exactly? – Andrew Henle Feb 02 '18 at 17:09
  • @Andrew Henle: The C concept of "storage" (what `malloc` allocates) in real-life desktop platforms is backed by so called "virtual memory". The issue of "overcommiting memory" is an issue with some implementations of virtual memory mechanism specifically. Virtual memory is not RAM. – AnT stands with Russia Feb 02 '18 at 17:56