12

Consider a program which uses a large number of roughly page-sized memory regions (say 64 kB or so), each of which is rather short-lived. (In my particular case, these are alternate stacks for green threads.)

How would one best do to allocate these regions, such that their pages can be returned to the kernel once the region isn't in use anymore? The naïve solution would clearly be to simply mmap each of the regions individually, and munmap them again as soon as I'm done with them. I feel this is a bad idea, though, since there are so many of them. I suspect that the VMM may start scaling badly after a while; but even if it doesn't, I'm still interested in the theoretical case.

If I instead just mmap myself a huge anonymous mapping from which I allocate the regions on demand, is there a way to "punch holes" through that mapping for a region that I'm done with? Kind of like madvise(MADV_DONTNEED), but with the difference that the pages should be considered deleted, so that the kernel doesn't actually need to keep their contents anywhere but can just reuse zeroed pages whenever they are faulted again.

I'm using Linux, and in this case I'm not bothered by using Linux-specific calls.

Dolda2000
  • 25,216
  • 4
  • 51
  • 92
  • What are "many"? Hundreds of thousands? – unwind Feb 12 '14 at 08:51
  • 1
    @unwind: Well, to be honest, I'm more interested in the theoretical case than the practical, so let's say millions. :) – Dolda2000 Feb 12 '14 at 08:52
  • I wouldn't be so sure what you mean by theoretical. Your question is a concrete engineering question depending on a lot of factors. I wouldn't even be sure that the concrete version of linux hasn't an impact on this kind of question. What your question basically boils down to, is whether or not you should emulate a system task (memory allocation of large chunks) in user space. You should only ask yourself such a question if you are stuck to a platform where the feature in question is implemented poorly. I don't think that this is the case for linux, the VM management is quite sophisticated. – Jens Gustedt Feb 12 '14 at 09:14
  • @JensGustedt: I meant the question of being able to punch holes through memory *in itself*, rather than whether it's useful under a certain set of premises. – Dolda2000 Feb 12 '14 at 09:15
  • @Dolda2000 But your basic assumption is that just using the APIs in the most straight-forward way would be bad, i.e. you can do it better in userspace. That's a ... not dangerous, but "weird" notion. The kernel is supposed to be *good* at that stuff. If it can't manage a million `mmap()`ed regions, it's broken and you shouldn't spend time trying to work around it in userspace. That's my feeling at least, but I'm no kernel dev. – unwind Feb 12 '14 at 09:25
  • @unwind: Not necessarily. I can imagine other circumstances than just performance that might also make it useful, such as a need to manage allocations from a contiguous block of memory manually (such as for a custom `malloc` implementation or the like). Or Sergey's scenario below, for that matter. – Dolda2000 Feb 12 '14 at 09:34

2 Answers2

8

I did a lot of research into this topic (for a different use) at some point. In my case I needed a large hashmap that was very sparsely populated + the ability to zero it every now and then.

mmap solution:

The easiest solution (which is portable, madvise(MADV_DONTNEED) is linux specific) to zero a mapping like this is to mmap a new mapping above it.

 void * mapping = mmap(MAP_ANONYMOUS);
 // use the mapping

 // zero certain pages
 mmap(mapping +  page_aligned_offset, length, MAP_FIXED | MAP_ANONYMOUS);

The last call is performance wise equivalent to subsequent munmap/mmap/MAP_FIXED, but is thread safe.

Performance wise the problem with this solution is that the pages have to be faulted in again on a subsequence write access which issues an interrupt and a context change. This is only efficient if very few pages were faulted in in the first place.

memset solution:

After having such crap performance if most of the mapping has to be unmapped I decided to zero the memory manually with memset. If roughly over 70% of the pages are already faulted in (and if not they are after the first round of memset) then this is faster then remapping those pages.

mincore solution:

My next idea was to actually only memset on those pages that have been faulted in before. This solution is NOT thread-safe. Calling mincore to determine if a page is faulted in and then selectively memset them to zero was a significant performance improvement until over 50% of the mapping was faulted in, at which point memsetting the entire mapping became simpler (mincore is a system call and requires one context change).

incore table solution:

My final approach which I then took was having my own in-core table (one bit per page) that says if it has been used since the last wipe. This is by far the most efficient way since you will only be actually zeroing the pages in each round that you actually used. It obviously also is not thread safe and requires you to track which pages have been written to in user space, but if you need this performance then this is by far the most efficient approach.

Sergey L.
  • 21,822
  • 5
  • 49
  • 75
  • 2
    Your `mmap` solution actually seems to be precisely what I was looking for. In my particular case, speed is not all that important, since new references are comparatively rare, but I would like to save the memory that isn't in active use. It actually just didn't strike me that I could simply `mmap` over the existing mapping, and testing this while checking with `pmap` verifies that doing it that way doesn't fragment the mapping. Great answer in all else! – Dolda2000 Feb 12 '14 at 09:43
4

I don't see why doing lots of calls to mmap/munmap should be that bad. The lookup performance in the kernel for mappings should be O(log n).

Your only options as it seems to be implemented in Linux right now is to punch holes in the mappings to do what you want is mprotect(PROT_NONE) and that is still fragmenting the mappings in the kernel so it's mostly equivalent to mmap/munmap except that something else won't be able to steal that VM range from you. You'd probably want madvise(MADV_REMOVE) work or as it's called in BSD - madvise(MADV_FREE). That is explicitly designed to do exactly what you want - the cheapest way to reclaim pages without fragmenting the mappings. But at least according to the man page on my two flavors of Linux it's not fully implemented for all kinds of mappings.

Disclaimer: I'm mostly familiar with the internals of BSD VM systems, but this should be quite similar on Linux.

As in the discussion in comments below, surprisingly enough MADV_DONTNEED seems to do the trick:

#include <sys/types.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>

#include <stdio.h>
#include <unistd.h>

#include <err.h>

int
main(int argc, char **argv)
{
        int ps = getpagesize();
        struct rusage ru = {0};
        char *map;
        int n = 15;
        int i;

        if ((map = mmap(NULL, ps * n, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)) == MAP_FAILED)
                err(1, "mmap");

        for (i = 0; i < n; i++) {
                map[ps * i] = i + 10;
        }

        printf("unnecessary printf to fault stuff in: %d %ld\n", map[0], ru.ru_minflt);

        /* Unnecessary call to madvise to fault in that part of libc. */
        if (madvise(&map[ps], ps, MADV_NORMAL) == -1)
                err(1, "madvise");

        if (getrusage(RUSAGE_SELF, &ru) == -1)
                err(1, "getrusage");
        printf("after MADV_NORMAL, before touching pages: %d %ld\n", map[0], ru.ru_minflt);

        for (i = 0; i < n; i++) {
                map[ps * i] = i + 10;
        }

        if (getrusage(RUSAGE_SELF, &ru) == -1)
                err(1, "getrusage");
        printf("after MADV_NORMAL, after touching pages: %d %ld\n", map[0], ru.ru_minflt);

        if (madvise(map, ps * n, MADV_DONTNEED) == -1)
                err(1, "madvise");

        if (getrusage(RUSAGE_SELF, &ru) == -1)
                err(1, "getrusage");
        printf("after MADV_DONTNEED, before touching pages: %d %ld\n", map[0], ru.ru_minflt);

        for (i = 0; i < n; i++) {
                map[ps * i] = i + 10;
        }

        if (getrusage(RUSAGE_SELF, &ru) == -1)
                err(1, "getrusage");
        printf("after MADV_DONTNEED, after touching pages: %d %ld\n", map[0], ru.ru_minflt);

        return 0;
}

I'm measuring ru_minflt as a proxy to see how many pages we needed to allocate (this isn't exactly true, but the next sentence makes it more likely). We can see that we get new pages in the third printf because the contents of map[0] are 0.

Art
  • 19,807
  • 1
  • 34
  • 60
  • `mprotect(PROT_NONE)` doesn't seem right. That doesn't actually allow the kernel to reclaim the pages, does it? And yes, `MADV_REMOVE` is what got my hopes up, but the documentation does indeed claim that it's only implemented as a specialty by a few filesystems. – Dolda2000 Feb 12 '14 at 09:09
  • @Dolda2000 I don't know if PROT_NONE allows Linux to reclaim the pages, I know it does in few flavors of BSD. But doing that inside a larger mmap just splits that mmap into two or three separate mappings, so it's doesn't prevent fragmentation inside the kernel. It's almost like `munmap` except that those particular addresses won't be marked as free, so they won't be available for allocation. – Art Feb 12 '14 at 09:16
  • I just recalled something. Aren't anonymous mappings in Linux actually implemented internally on top of tmpfs and as such `madvise(MADV_REMOVE)` should work on them? I need to test this, back in a few minutes. – Art Feb 12 '14 at 09:19
  • I tried myself just now, and it doesn't seem to work, unfortunately. `madvise` returns `EINVAL` when I call it on an anonymous mapping. – Dolda2000 Feb 12 '14 at 09:28
  • I didn't get `MADV_REMOVE` to work, but surprisingly `MADV_DONTNEED` seems to do exactly what you want, which is not what I would expect for how it should be implemented. The pages I did `MADV_DONTNEED` on get zero-filled on the next read and the number of page faults (as reported by `/usr/bin/time -v`) raises as expected. – Art Feb 12 '14 at 09:33
  • 1
    Oh, what do you know. It seems I didn't read the `madvise` manpage properly. I didn't expect `MADV_DONTNEED` to actually allow the kernel to clear the pages. The more you know. – Dolda2000 Feb 12 '14 at 09:51
  • I didn't expect `DONTNEED` to do that either. This is not how it's implemented in BSD. `DONTNEED` in BSD only moves the pages to the inactive queue which means that they will be eaten by the pagedaemon and either thrown away or swapped out on the next pagedaemon sweep. Anyway, I updated the answer with the short hack I used to test this. – Art Feb 12 '14 at 09:54