Linux mremap without freeing the old mapping?

Question

I need a way to copy pages from one virtual address range to another without actually copying the data. The ranges are massive and latency is important. mremap can do this, but the problem is it also deletes the old mapping. Since I need to do this in a multithreaded environment I need the old mapping to be simultaneously usable, I will free it later when I'm certain no other threads can be using it. Is this possible, however hacky, without modifying the kernel? The solution only need work with recent Linux kernels.

Out of curiosity, why bother with remap if the memory is still being accessed at the old address? — user4815162342, Sep 19 '13 at 20:44
Is there another mechanism to map the same pages to a new, larger mapping at a different address? That would answer the question. — Eloff, Sep 19 '13 at 20:49
I don't know the answer to your question (AFAIK it's impossible), but I'm curious as to why you need that. After all, if you only need to enlarge the mapping, that should be possible without relocating it, as long as you keep adjacent mapping well-spaced, which should not be a problem unless you're running a 32-bit kernel. — user4815162342, Sep 19 '13 at 20:52
Difficult and unreliable with potentially hundreds of thousands of large mappings on x64. I'd rather modify the kernel. These mappings are actually mapped shm_open names. I think it might be possible to extend the mapping with ftruncate and then map the new larger region in an overlapping new mmap in the same process. Is that possible? — Eloff, Sep 19 '13 at 21:03
Are you mapping an actual file, or an anonymous region? I _think_ if it's a file, then you can use `mmap` the same file again and get a different set of addresses. — Mats Petersson, Sep 19 '13 at 21:56

Nominal Animal · Accepted Answer · 2015-02-04T04:34:00.680

It is possible, although there are architecture-specific cache consistency issues you may need to consider. Some architectures simply do not allow the same page to be accessed from multiple virtual addresses simultaneously without losing coherency. So, some architectures will manage this fine, others do not.

Edited to add: AMD64 Architecture Programmer's Manual vol. 2, System Programming, section 7.8.7 Changing Memory Type, states:

A physical page should not have differing cacheability types assigned to it through different virtual mappings; they should be either all of a cacheable type (WB, WT, WP) or all of a non-cacheable type (UC, WC, CD). Otherwise, this may result in a loss of cache coherency, leading to stale data and unpredictable behavior.

Thus, on AMD64, it should be safe to mmap() the same file or shared memory region again, as long as the same prot and flags are used; it should cause the kernel to use the same cacheable type to each of the mappings.

The first step is to always use a file backing for the memory maps. Use mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0) so that the mappings do not reserve swap. (If you forget this, you'll run into swap limits much sooner than you hit actual real life limits for many workloads.) The extra overhead caused by having a file backing is absolutely neglible.

Edited to add: User strcmp pointed out that current kernels do not apply address space randomization to the addresses. Fortunately, this is easy to fix, by simply supplying randomly generated addresses to mmap() instead of NULL. On x86-64, the user address space is 47-bit, and the address should be page aligned; you could use e.g. Xorshift* to generate the addresses, then mask out the unwanted bits: & 0x00007FFFFE00000 would give 2097152-byte-aligned 47-bit addresses, for example.

Because the backing is to a file, you can create a second mapping to the same file, after enlarging the backing file using ftruncate(). Only after a suitable grace period -- when you know no thread is using the mapping anymore (perhaps use an atomic counter to keep track of that?) --, you unmap the original mapping.

In practice, when a mapping needs to be enlarged, you first enlarge the backing file, then try mremap(mapping, oldsize, newsize, 0) to see if the mapping can be grown, without moving the mapping. Only if the in-place remapping fails, do you need to switch to the new mapping.

Edited to add: You definitely do want to use mremap() instead of just using mmap() and MAP_FIXED to create a larger mapping, because mmap() unmaps (atomically) any existing mappings, including those belonging to other files or shared memory regions. With mremap(), you get an error if the enlarged mapping would overlap with existing mappings; with mmap() and MAP_FIXED, any existing mappings that the new mapping overlaps are ignored (unmapped).

Unfortunately, I must admit I haven't verified if the kernel detects collisions between existing mappings, or if it just assumes the programmer knows about such collisions -- after all, the programmer must know the address and length of every mapping, and therefore should know if the mapping would collide with anther existing one. Edited to add: The 3.8 series kernels do, returning MAP_FAILED with errno==ENOMEM if the enlarged mapping would collide with existing maps. I expect all Linux kernels to behave the same way, but have no proof, aside from testing on 3.8.0-30-generic on x86_64.

Also note that in Linux, POSIX shared memory is implemented using a special filesystem, typically a tmpfs mounted at /dev/shm (or /run/shm with /dev/shm being a symlink). The shm_open() et. al are implemented by the C library. Instead of having a large POSIX shared memory capability, I'd personally use a specially mounted tmpfs for use in a custom application. If not for anything else, the security controls (users and groups able to create new "files" in there) are much easier and clearer to manage.

If the mapping is, and has to be, anonymous, you can still use mremap(mapping, oldsize, newsize, 0) to try and resize it; it just may fail.

~~Even with hundreds of thousands of mappings, the 64-bit address space is vast, and the failure case rare. So, although you must handle the failure case too, it does not necessarily have to be fast.~~ Edited to modify: On x86-64, the address space is 47-bit, and mappings must start at a page boundary (12 bits for normal pages, 21 bits for 2M hugepages, and 30 bits for 1G hugepages), so there is only 35, 26, or 17 bits available in the address space for the mappings. So, the collisions are more frequent, even if random addresses are suggested. (For 2M mappings, 1024 maps had an occasional collision, but at 65536 maps, the probability of a collision (resize failure) was about 2.3%.)

Edited to add: User strcmp pointed out in a comment that by default Linux mmap() will return consecutive addresses, in which case growing the mapping will always fail unless it's the last one, or a map was unmapped just there.

The approach I know works in Linux is complicated and very architecture-specific. You can remap the original mapping read-only, create a new anonymous map, and copy the old contents there. You need a SIGSEGV handler (SIGSEGV signal being raised for the particular thread that tries to write to the now read-only mapping, this being one of the few recoverable SIGSEGV situations in Linux even if POSIX disagrees) that examines the instruction that caused the problem, simulates it (modifying the contents of the new mapping instead), and then skips the problematic instruction. After a grace period, when there are no more threads accessing the old, now read-only mapping, you can tear down the mapping.

All of the nastiness is in the SIGSEGV handler, of course. Not only must it be able to decode all machine instructions and simulate them (or at least those that write to memory), but it must also busy-wait if the new mapping has not been completely copied yet. It is complicated, absolutely unportable, and very architecture-specific.. but possible.

I'm having trouble googling x64 virtual address aliasing and whether it will work or not, do you know? The problem is I've got up to 1000 processes, each of which use ridiculous amounts of virtual address space, on a box with 256GB of ram and only a piddling 512GB of disk, half of which is earmarked for other things. So file backed is not possible, but afaik the shm_open handles work like nameable, shareable mappings. Since they're nameable you can try to map the same one in the same process (about to test what the kernel does with that.) — Eloff, Sep 19 '13 at 23:25
Failing that I much prefer changing mremap in the kernel (adding an extra flag to keep the old mapping.) I'd imagine that's relatively easy and may even get accepted upstream, but then my imagination doesn't always intersect with reality. — Eloff, Sep 19 '13 at 23:25
mremap is not needed at all, the shm_open fd can be mapped, extended with ftruncate, and then mapped again in a new overlapping mapping. You get two seperate virtual addresses and writing to one and reading from the other works just fine. Please update your answer to make this clear for anyone who finds this answer later, and I will accept it. — Eloff, Sep 19 '13 at 23:53
@Eloff, wrt. address aliasing and cache coherency: According to AMD64 Architecture Programmer's Manual vol. 2, section 7.8.7, multiple virtual mappings to the same pages will work fine as long as they are all either cacheable (writeback, writethrough, or write-protected), or noncacheable (uncached, write-combining, or caching disabled); not a mix. — Nominal Animal, Sep 20 '13 at 00:16
@Eloff: Each process has their own 64-bit virtual address space. Each map will likely reside at different virtual addresses for each process. In Linux, POSIX shared memory (`shm_open()` et. al) is implemented via `/dev/shm` or `/run/shm`, typically a tmpfs filesystem. If I were you, I'd use a dedicated tmpfs for the file "backing", not POSIX shared memory. The mappings use the page cache pages, so there is no duplication anyway. — Nominal Animal, Sep 20 '13 at 00:38
@Eloff, wrt. no `mremap()`: `mmap()` with `MAP_FIXED` flag will happily overwrite *ALL* existing maps, even those belonging to other files (or shared memory regions, which are basically the same thing in Linux). So, if you use `mmap(addr,newsize,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_NORESERVE|MAP_FIXED,fd,0)` to "grow" the region, you won't get an error if the new region would overlap with another mapping; it will just succeed. On the other hand, with `mremap()`, you get an error if the enlarged mapping collides with another mapping. So, you do need to use `mremap()`. — Nominal Animal, Sep 20 '13 at 00:49
@Animal I was thinking of specifying mmap(NULL, RW, MAP_SHARED | MAP_NORESERVE, fd, 0) which will give you a new virtual address which shares pages with the previous mapping. When done with the old address it can safely be unmapped without affecting the new mapping. Btw, whats wrong with POSIX shared memory vs dedicated tmpfs, it should be the same thing on Linux? — Eloff, Oct 03 '13 at 00:04
@Eloff: Right, seems fine to me. There is nothing wrong with POSIX shared memory per se, it's just that since it is implemented via tmpfs on Linux, dedicating a separate tmpfs for your application gives the sysadmin more detailed control. See the `tmpfs` entries in `/etc/fstab`, `size` attribute for example. For app-specific tmpfs, you can set also `uid`, `gid`, `mode`. This way only specific users can utilize the tmpfs at all. When your service runs as a specific user, you can easily control the resources dedicated to it. As a sysadmin, I find such controls very useful, that's all. — Nominal Animal, Oct 03 '13 at 00:51
The kernel picks a base address and then creates mappings in the gap at the lowest address. Using `mremap` without `MREMAP_MAYMOVE` will *almost always fail* because everything is tightly packed and growing up will hit the old mappings. It can only succeed when another mapping was unmapped there, fragmenting the virtual address space. The useful thing about `mremap` is that it can perform moves via copying the page tables rather than copying data, which is *much* faster. — strcat, Feb 03 '15 at 21:56
@strcat: And *that* is the reason you downvoted my answer? Did you even read the original question? You pointed out a detail that is easily fixed, thanks for that. I'll amend the answer to reflect that. — Nominal Animal, Feb 04 '15 at 04:15
ASLR does apply to `mmap`: it randomizes the base. Spreading out mappings over the address space will cause a significant performance hit along with pathologically fragmenting the address space into ever smaller gaps which could easily lead to OOM even on 64-bit when a large mapping is requested. — strcat, Feb 04 '15 at 11:48
@strcat: Don't be stupid, please, and read the original question. You downvoted my answer because you found an irrelevant detail in it objectionable, instead of anything relevant to it as an answer to the posed question. Virtual address space fragmentation is the least worry the OP has. For 4k pages, your "significant performance hit" is roughly 10% worst-case (measured on x86-64 using 1024 maps, full 47-bit address space randomization). For huge pages, the performance hit is smaller (as the page tables are much smaller). — Nominal Animal, Feb 05 '15 at 07:21
I'm well aware of the original question. I didn't downvote your answer because of one technical error. I just don't feel like elaborating on what I think is wrong with it. — strcat, Feb 05 '15 at 11:31
I don't know how you're measuring the performance hit, but it certainly bigger than 10% for many use cases, especially for a memory bound workload which is probably the case Virtual memory fragmentation is hard to dismiss when they're stating that they have *many* very large mappings. Randomly spraying the address space with allocations is going to rapidly bring down the size of the largest gap. — strcat, Feb 05 '15 at 11:34
@strcat: I measured the time (both CLOCK_MONOTONIC and CLOCK_PROCESS_CPUTIME_ID clocks) for filling the first page of each mapping a few times using `memset()` after all mappings are established, including the initial page hit, and compared the times for 1024 mappings, 2M each, spread around the 47-bit address space. No large latencies were detected, and the variance in timing was in the 4-8% range. — Nominal Animal, Feb 05 '15 at 17:17
@strcat: Instead of just downvoting because of your "feelings", perhaps you should instead consider offering your own answer? Frankly, I see downvoting an answer if you are incapable of pointing out the issues as either useless or petty. Have I pissed you off somehow? The only reason I care is that I care about the *quality* of my answers -- I never vote on other answers myself --, and am always ready to admit to my error, and try to fix it. I kinda liked having only two downvotes among my 200 answers thus far. Yours is the third, and the first that I cannot understand at all. — Nominal Animal, Feb 05 '15 at 17:26

stsp · Answer 2 · 2016-07-23T12:57:55.803

4

Yes, you can do this.

mremap(old_address, old_size, new_size, flags) deletes the old mapping only of the size "old_size". So if you pass 0 as "old_size", it will not unmap anything at all.

Caution: this works as expected only with shared mappings, so such mremap() should be used on a region previously mapped with MAP_SHARED. This is actually all of that, i.e. you don't even need a file-backed mapping, you can successfully use "MAP_SHARED | MAP_ANONYMOUS" combination for mmap() flags. Some very old OSes may not support "MAP_SHARED | MAP_ANONYMOUS", but on linux you are safe.

If you try that on a MAP_PRIVATE region, the result would be roughly similar to memcpy(), i.e. no memory alias will be created. But it will still use the CoW machinery. It is not clear from your initial question whether do you need an alias, or the CoW copy is fine too.

UPDATE: for this to work, you also need to specify the MREMAP_MAYMOVE flag obviously.

edited Jul 23 '16 at 12:57

answered Mar 02 '16 at 11:34

stsp

308
1
6

“if you pass 0 as "old_size", it will not unmap anything at all” — that's technically correct, but it also means that no virtual address will be remapped to begin with, thus making the call to `mremap` useless and leaving OP's problem unsolved. Pages are indeed moved from [`old_address`, `old_address + min(old_size, new_size)`] to [`new_address`, `new_address + min(old_size, new_size)`]. – Arkanosis Jul 22 '16 at 10:29
Please check your facts. What you say is simply wrong and any small test-case can confirm this. – stsp Jul 23 '16 at 12:53
It is also a bit unclear what dos the OP really want: does he need a memory alias, or does he need a CoW copy of the original region? – stsp Jul 23 '16 at 13:01
This (being able to remap MAP_PRIVATE) is a good feature but seems to be no longer supported because it is regarded as a bug: http://man7.org/linux/man-pages/man2/mremap.2.html – Determinant Dec 10 '19 at 17:43

score 3 · Answer 3 · answered Apr 22 '20 at 23:04

3

This was added in the 5.7 kernel as a new flag to mremap(2) called MREMAP_DONTUNMAP. This leaves the existing mapping in place after moving the page table entries.

See https://github.com/torvalds/linux/commit/e346b3813067d4b17383f975f197a9aa28a3b077#diff-14bbdb979be70309bb5e7818efccacc8

answered Apr 22 '20 at 23:04

Brian G

31
1

What is the advantage over setting 0 as old_size to mremap? – stsp Jun 11 '20 at 17:48

Linux mremap without freeing the old mapping?

3 Answers3

Linked