lazy overcommit allocation and calloc

Question

Can someone in the know please explain how lazy-backed heap storage interacts with the memory-zeroing guarantees of calloc/realloc? Specifically, I would like to know:

if/when the zero writes would cause storage to be faulted in immediately
if/when not, should I be concerned about the context where the faulting in may take place (eg, a read syscall done from assembly)

I see them as completely independent features. Overcommit is something done by the OS so that when the program requires more memory it is already mapped. Zeroing is done by the C-runtime only on the memory requested. You never have to deal with faults when using the memory that you requested and was correctly allocated by the OS (how it swap it in and out it's an internal thing). — Margaret Bloom, Apr 24 '21 at 08:31
@MargaretBloom: Right, if the kernel runs out of RAM + swap to satisfy all the allocations it's already allowed, it will kill a process that's using a lot of memory. (the OOM killer has some heuristics to pick a PID). So yeah, processes never have to handle segfaults (SIGSEGV), and soft/hard page faults are always transparent, except when they result in the whole process being killed if it was the memory hog! — Peter Cordes, Apr 24 '21 at 08:44

score 5 · Accepted Answer · answered Apr 24 '21 at 08:41

5

calloc can get guaranteed-zero pages from the OS, and thus avoid having to write zeros in user-space at all. (Especially for large allocations, otherwise it'll zero something from the free list if there are any free-list entries of the right size.) That's where the laziness comes in.

So your page will be fresh from mmap(MAP_ANONYMOUS), untouched by user-space. Reading it will trigger a soft page fault that copy-on-write maps it to a shared physical page of zeros. (So fun fact, you can get TLB misses but L1d / L2 cache hits when looping read-only over a huge calloc allocation).

Writing that page / one of those pages (as the first access, or after it's CoW mapped to a zero page) will soft page-fault, and Linux's page-fault handler will allocate a new physical page and zero it. (So after the page fault, the whole page is generally hot in L1d cache, or at least L2, even with faultaround to prepare more pages and wire them into the page table to reduce the number of page faults, if there are neighbouring pages that are also lazily allocated).

But no, you don't generally need to worry about it, other than general performance tuning. If you logically own some memory, you can ask read to put data into it. The libc wrapper isn't doing any special retrying there; all the magic (checking for the target page being present and treating it like a soft or hard page fault) happens inside the kernel's implementation of read, as part of copy_to_user.

(Basically a memcpy from kernel memory to user-space, with permission checking that can make it return -EFAULT if you pass the kernel a pointer that you don't even logically own. i.e. memory that would segfault if you touched it from user-space. Note that you don't get a SIGSEGV from read(0, NULL, 1), just an error. Use strace ./a.out to see, as an alternative to actually implementing error checking in your hand-written asm.)

answered Apr 24 '21 at 08:41

Peter Cordes

328,167
45
605
847

> fun fact, you can get TLB misses but L1d / L2 cache hits when looping read-only over a huge calloc allocation awkward... ---- > If you logically own some memory, you can ask ```read``` to put data into it [...] all the magic [...] happens inside the kernel's implementation of ```read```[...] Well, yes. So we're already bouncing in and out of the page fault handler/calling GFP from kernel context. So... what's a few rep stos between friends, I guess? Good to know it's nbd. – l.k Apr 26 '21 at 07:25
(that being said, I'm ok with ```test eax,eax; js .error``` :p) – l.k Apr 26 '21 at 07:35
@l.k: I don't know if the kernel has any shortcuts to avoid actually zeroing the memory if `read` is going to overwrite a whole page that was previously only lazily allocated. It plausibly could, but would have to make sure `read` was definitely going to overwrite the whole page, or you'd have to zero the remaining bytes to avoid leaking data. (And all of this before wiring it into the page table, otherwise another thread of the same process could read stale data from that page.) – Peter Cordes Apr 26 '21 at 07:37
@l.k: I think it *does* avoid actually triggering a `#PF` page-fault exception inside the kernel, though, by checking for page presence before using `rep movsb`. Perhaps I'm wrong; it was a while since I read that. And unfortunately it has to check every time, unlike relying on HW which makes it zero cost when the page is present. And yeah, for `read` it should be safe to just check the sign bit of the return value, and even only the low 32 bits if you never make a `read` call larger than 4GiB. In general though, `cmp rax, -4095` / `jae error` works for literally every system call. – Peter Cordes Apr 26 '21 at 07:39
I would imagine that optimization would add too much complexity for the value, since it should only fire if there isn't a pre-zero'd page available, no? – l.k Apr 26 '21 at 07:40
@l.k: Linux doesn't pre-zero pages. Zeroing on the fly is so cheap on modern CPUs, and doing it any other time would pollute cache with those zeros. (Unless you used `movnti` cache-bypassing stores, but that would make it take even more CPU time than `rep stos`). But yeah, I think the reasoning is that it's not worth the bookkeeping, or trying to schedule the work. Priming the TLB and data cache for the page right before user-space uses it amortizes some of the cost of doing it; if it had been zeroed a while ago, user-space would TLB miss and cache miss. – Peter Cordes Apr 26 '21 at 07:43
@l.k: BTW, reading files that are hot in pagecache is often best done with `mmap` on modern CPUs. You just ask the kernel to map those pages into your address space, and no kernel-side copying happens. So no `copy_to_user`. Without `MAP_POPULATE`, you can still get soft page faults from lazy mapping (but with faultaround it's ok). But MAP_POPULATE is bad for the whole file at once with large files. (You don't get to overlap computation with kernel readahead if the file *isn't* hot in pagecache). `madvise` MADV_SEQUENTIAL can help, IIRC. – Peter Cordes Apr 26 '21 at 07:47
hm. I'm sure I saw some functions doing exactly that while scanning past ([blah]_zero_page names or similar?). I probably misunderstood something – l.k Apr 26 '21 at 07:47
@l.k: I don't keep up with details of Linux kernel changes; perhaps it does now keep a pool of zeroed pages, if that's what you mean. Certainly possible someone did decide to implement that, at least as an *optional* feature that might not be used by default. Dirtying 64 cache lines isn't great if you're in the middle of something else. Some workloads on some CPUs probably do benefit; it's a tradeoff. – Peter Cordes Apr 26 '21 at 07:52
As for the scheduling/caching costs of the zeroing, all true if you're only doing it to pre-prepare zero pages. Wouldn't pages zeroed on release to clear sensitive data be worth keeping on a pre-zeroed free list of sorts, though? – l.k Apr 26 '21 at 07:55
@l.k: Oh, sure I guess if you're doing zero-on-release at all, that could make it more worth it do to bookkeeping to avoid re-zeroing later. Normally all kernel pages are assumed to be sensitive, though, which is why the kernel tries not to leak them. (But I guess with Spectre being hard to fully mitigate, it makes sense to be more cautious these days? As well as multiple layers of defence for things that might be extra sensitive?) – Peter Cordes Apr 26 '21 at 08:01
userspace has sensitive data too... and the zeroes are in physical memory (obviously), unless it's low mem or something the last address it was used by shouldn't really matter – l.k Apr 26 '21 at 08:06
@l.k: Yeah, but every user-space process already has to trust the kernel not to let other processes see its data while it's running or afterwards. The kernel doesn't have to zero physical pages after processes exit just for that. And AFAIK, there's no system-call for user-space to ask the kernel to securely zero a page while giving up ownership of it. (e.g. a flag for munmap). If user-space wants to zero its own memory that's fine, but the kernel can't trust user-space to have done that. So I don't see how that's relevant unless there's a (new?) system call I'm not aware of. – Peter Cordes Apr 26 '21 at 08:10
suid programs that crash/dump? – l.k Apr 26 '21 at 08:13
@l.k: Are you talking about core files? IIRC, SUID programs don't write coredumps. Or if they do, they're only readable by root. Core files are by definition files on the filesystem, at which point their data has to exist somewhere, so it can't have been zeroed. I have no idea what that would have to do with maintaining a special free-list of zeroed pages or zero-on-free of pages when they weren't in use anymore. (I thought we were still talking about that.) – Peter Cordes Apr 26 '21 at 08:16
they will not unless /proc/sys/fs/suid_dumpable and the like allow it; if they don't I was under the impression the memory was cleaned up, for much the same reason as not dropping the core file. – l.k Apr 26 '21 at 08:20
@l.k: That doesn't make sense to me, unless you're claiming that a UID=0 process exiting normally would also have its pages zeroed. (Which I don't think is the case.) Kernel memory always has to be assumed to hold sensitive data, which is why Linux never lets user-space access a page that hasn't be zeroed or otherwise filled with data it owns (e.g. DMA from swap space or something). Linux protects processes and users from each other by making sure never to leak values out of the kernel, not by zeroing stuff when its returned to general kernel ownership. – Peter Cordes Apr 26 '21 at 08:26
true, and I may be wrong; I didn't stop to understand the zero_page functions and I only vaguely remember about the suid crash thing from a man page. I don't think it happens for exiting normally either. – l.k Apr 26 '21 at 08:32

lazy overcommit allocation and calloc

1 Answers1