Random mmaped memory access up to 16% slower than heap data access

Question

Our software builds a data structure in memory that is about 80 gigabytes large. It can then either use this data structure directly to do its computation, or dump it to disk so it can be reused several times afterwards. A lot of random memory accesses happens in this data structure.

For larger input this data structure can grow even larger (our largest one was over 300 gigabytes large) and our servers have enough memory to hold everything in RAM.

If the data structure is dumped to disk, it gets loaded back into the address space with mmap, forced into the os page cache, and lastly mlocked (code at the end).

The problem is that there is about a 16% difference in performance between just using the computed data structure immediately on the heap (see Malloc version), or mmaping the dumped file (see mmap version ). I don't have a good explanation why this is the case. Is there a way to find out why mmap is being so much slower? Can I close this performance gap somehow?

I did the measurements on a server running Scientific Linux 7.2 with a 3.10 kernel, it has 128GB RAM (enough to fit everything), and repeated them several times with similar results. Sometimes the gap is a bit smaller, but not by much.

New Update (2017/05/23):

I produced a minimal test case, where the effect can be seen. I tried the different flags (MAP_SHARED etc.) without success. The mmap version is still slower.

#include <random>
#include <iostream>
#include <sys/time.h>
#include <ctime>
#include <omp.h>
#include <sys/mman.h>
#include <unistd.h>

constexpr size_t ipow(int base, int exponent) {
    size_t res = 1;
    for (int i = 0; i < exponent; i++) {
        res = res * base;
    }
    return res;
}

size_t getTime() {
    struct timeval tv;

    gettimeofday(&tv, NULL);
    size_t ret = tv.tv_usec;
    ret /= 1000;
    ret += (tv.tv_sec * 1000);

    return ret;
}

const size_t N = 1000000000;
const size_t tableSize = ipow(21, 6);

size_t* getOffset(std::mt19937 &generator) {
    std::uniform_int_distribution<size_t> distribution(0, N);
    std::cout << "Offset Array" << std::endl;
    size_t r1 = getTime();
    size_t *offset = (size_t*) malloc(sizeof(size_t) * tableSize);
    for (size_t i = 0; i < tableSize; ++i) {
        offset[i] = distribution(generator);
    }
    size_t r2 = getTime();
    std::cout << (r2 - r1) << std::endl;

    return offset;
}

char* getData(std::mt19937 &generator) {
    std::uniform_int_distribution<char> datadist(1, 10);
    std::cout << "Data Array" << std::endl;
    size_t o1 = getTime();
    char *data = (char*) malloc(sizeof(char) * N);
    for (size_t i = 0; i < N; ++i) {
        data[i] = datadist(generator);  
    }
    size_t o2 = getTime();
    std::cout << (o2 - o1) << std::endl;

    return data;
}

template<typename T>
void dump(const char* filename, T* data, size_t count) {
    FILE *file = fopen(filename, "wb");
    fwrite(data, sizeof(T), count, file); 
    fclose(file);
}

template<typename T>
T* read(const char* filename, size_t count) {
#ifdef MMAP
    FILE *file = fopen(filename, "rb");
    int fd =  fileno(file);
    T *data = (T*) mmap(NULL, sizeof(T) * count, PROT_READ, MAP_SHARED | MAP_NORESERVE, fd, 0);
    size_t pageSize = sysconf(_SC_PAGE_SIZE);
    char bytes = 0;
    for(size_t i = 0; i < (sizeof(T) * count); i+=pageSize){
        bytes ^= ((char*)data)[i];
    }
    mlock(((char*)data), sizeof(T) * count);
    std::cout << bytes;
#else
    T* data = (T*) malloc(sizeof(T) * count);
    FILE *file = fopen(filename, "rb");
    fread(data, sizeof(T), count, file); 
    fclose(file);
#endif
    return data;
}

int main (int argc, char** argv) {
#ifdef DATAGEN
    std::mt19937 generator(42);
    size_t *offset = getOffset(generator);
    dump<size_t>("offset.bin", offset, tableSize);

    char* data = getData(generator);
    dump<char>("data.bin", data, N);
#else
    size_t *offset = read<size_t>("offset.bin", tableSize); 
    char *data = read<char>("data.bin", N); 
    #ifdef MADV
        posix_madvise(offset, sizeof(size_t) * tableSize, POSIX_MADV_SEQUENTIAL);
        posix_madvise(data, sizeof(char) * N, POSIX_MADV_RANDOM);
    #endif
#endif

    const size_t R = 10; 
    std::cout << "Computing" << std::endl;
    size_t t1 = getTime();
    size_t result = 0;
#pragma omp parallel reduction(+:result)
    {
        size_t magic = 0;
        for (int r = 0; r < R; ++r) {
#pragma omp for schedule(dynamic, 1000)
            for (size_t i = 0; i < tableSize; ++i) {
                char val = data[offset[i]];
                magic += val;
            }
        }
        result += magic;
    }
    size_t t2 = getTime();

    std::cout << result << "\t" << (t2 - t1) << std::endl;
}

Please excuse the C++, its random class is easier to use. I compiled it like this:

#  The version that writes down the .bin files and also computes on the heap
g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native -DDATAGEN
# The mmap version
g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native -DMMAP
# The fread/heap version
g++ bench.cpp -fopenmp -std=c++14 -O3 -march=native -mtune=native
# For madvice add -DMADV

On this server I get the following times (ran all of the commands a few times):

./mmap
2030ms

./fread
1350ms

./mmap+madv
2030ms

./fread+madv
1350ms

numactl --cpunodebind=0 ./mmap 
2600 ms

numactl --cpunodebind=0 ./fread 
1500 ms

As I see it, fillWithData reads the whole file in one giant step. mmap on the other hand reads the file piece by piece whereever you access it. This may cause the performance difference. To be more realistic, rerun the benchmark including the write-to-disk-at-the-end portion... — Malkocoglu, May 16 '17 at 12:27
The write to disk is actually a separate program call, no computation happens afterwards if the data structure is dumped. We also force all the mmap memory into the page cache by running something like this: size_t pageSize = Util::getPageSize(); unsigned char bytes = 0; for(size_t i = 0; i < mappedSize; i+=pageSize){ bytes ^= mapped[i]; } magicBytes = bytes; — Brutos, May 16 '17 at 12:29
Have you tried measuring several times in the same run, to make sure, everything is nicely cached? — Erki Aring, May 16 '17 at 12:31
I added it as a comment -> // ... touch the first byte of every page to force it into the page cache, i'll replace it with code — Brutos, May 16 '17 at 12:32
If you use an anonymous mmap instead of malloc, is the performance the same as malloc? I realize that doesn't actually help answer the question that much, but if the answer is that anonymous mmap is slower than malloc, that would be really interesting (and improbable) — Sam Hartman, May 16 '17 at 12:38
I can try that, but the benchmarks will take quite a while to run (the small one takes an hour, the normal one takes about 13h). — Brutos, May 16 '17 at 12:41
Are you updating the data you `mmap()` in? If so, the first time you update the data you force the in-memory copy of data to have its backing store changed from the file its mapped from to anonymous memory backed by swap. This mapping change will take time. Memory obtained from `malloc()` will not have to have its backing store swapped upon modification. `malloc()` may also be implemented using larger page sizes. `mmap()` is not a panacea, it has significant performance issues when used in some ways. Read [this from one Linus Torvalds.](http://marc.info/?l=linux-kernel&m=95496636207616&w=2) — Andrew Henle, May 16 '17 at 12:55
@Brutos What file system? You can try using various combinations of larger page sizes with one of the `MAP_HUGETLB`, `MAP_HUGE_2MB` or `MAP_HUGE_1GB` `mmap()` flags. If you're accessing the data randomly, you may be seeing a performance hit from TLB misses, which the larger page size should fix. I'd also check if your `malloc()` makes use of larger page sizes. — Andrew Henle, May 16 '17 at 13:12
Have you considered *not* forcing the whole file into memory or locking it there? Paging in the data on demand could be a win, especially if the actual computation doesn't touch all the pages, or if the access pattern has relatively long runs of sequential-ish access. — John Bollinger, May 16 '17 at 13:29
@AndrewHenle, I am still very confused on how HUGETLB works, I tried running the malloc version yesterday with hugectl. But didn't see any performance differences. From what I gather from the documentation mmap and MAP_HUGETLB only works on MAP_ANON, not on file backed mmaps. I would have to use hugetlbfs for that I think. — Brutos, May 16 '17 at 13:51
@JohnBollinger, pretty much every page is used, when we tested it we saw pretty bad performance drops without touching the pages into memory. — Brutos, May 16 '17 at 13:52
Can you please profile both versions with perf, so we at least get some hints... — Andriy Berestovskyy, May 16 '17 at 13:55
@Brutos That's possible - my huge page experience on Linux is limited. You can try parallelizing your loop: `for (size_t i = 0; i < mappedSize; i+=pageSize){ bytes ^= mapped[i]; }`. You may be running into some [NUMA effects](https://en.wikipedia.org/wiki/Non-uniform_memory_access) - `malloc()` might spread the memory over all memory, while the memory layout from your `mmap()` might be nonuniformly distributed, causing access from non-local CPUs to suffer. Paralellizing the loop that causes the mappings to be instantiated with `#pragma omp parallel...` might help if you are seeing NUMA issues. — Andrew Henle, May 16 '17 at 13:56
As an aside, `mlock()` ought to be sufficient to page in the whole locked range; you should not need to touch all the individual pages first. — John Bollinger, May 16 '17 at 14:20
Is this a multi-socket system? Maybe this has something to do with how data is distributed across NUMA nodes (`numactl(8)`). — gudok, May 16 '17 at 14:49
Mapping a large file read-only (`PROT_READ`) should use `MAP_SHARED | MAP_NORESERVE` flags, so that swap will never be involved, and the page cache is used directly. You can just add `| MAP_LOCKED` instead of calling `mlock()`, and specify `| MAP_POPULATE` to load all the pages into memory at map time. — Nominal Animal, May 16 '17 at 15:02
`16% difference in performance between just using the computed data structure immediately on the heap (see Malloc version), or mmaping the dumped file` The malloc only has to 1) mmap() /dev/null and 2) find some physical memory when referenced. The from-disk version has to 1) find some phys memory when referenced and 2) **read the contents from disk** — wildplasser, May 17 '17 at 08:58
@wildplasser The performance difference appears to be measured after all the data is loaded and is being processed. — Andrew Henle, May 17 '17 at 14:01
Have you checked The VSIZ and the RSS of both processes at the end of their computations? Could be there are some holes in the memory space of the heap-based thing. — wildplasser, May 17 '17 at 14:35
You could also use getrusage(), especially the min/maj pagefaults. Or monitor the processes using vmstat. — wildplasser, May 17 '17 at 22:18
It does not seem to be a NUMA problem (see numactl --cpunodebind=0 ./mmap and ./fread). We also checked the mmap flags proposed by @NominalAnimal. They could not improve the performance. The madvise call could not improve the performance as well. We produced now a minimal example to reproduce the behaviour. — martin s, May 23 '17 at 17:59
@martins: Could you show `/proc/self/smaps` (the paragraphs related for the regions used for the data) in the `malloc()` and `mmap()` cases? (You'd have to add a function to dump `/proc/self/smaps` to e.g. a file, at the end of your test run, to your test program.) It would tell us the differences in the two cases. — Nominal Animal, May 24 '17 at 10:34
Is there any reason to tag this `c` rather than the `c++` it is? — EOF, May 26 '17 at 09:35

Morian · Accepted Answer · 2017-05-24T18:50:06.620

14

malloc() back-end can make use of THP (Transparent Huge Pages), which is something not possible when using mmap() backed by a file.

Using huge pages (even transparently) can reduce drastically the number of TLB misses while running your application.

An interesting test could be to disable transparent hugepages and run your malloc() test again. echo never > /sys/kernel/mm/transparent_hugepage/enabled

You could also measure TLB misses using perf:

perf stat -e dTLB-load-misses,iTLB-load-misses ./command

For more infos on THP please see: https://www.kernel.org/doc/Documentation/vm/transhuge.txt

People are waiting for a long time to have a page cache which is huge page capable, allowing the mapping of files using huge pages (or a mix of huge pages and standard 4K pages). There are a bunch of articles on LWN about transparent huge page cache, but it does not have reached production kernel yet.

Transparent huge pages in the page cache (May 2016): https://lwn.net/Articles/686690

There is also a presentation from January this year about the future of Linux page cache: https://youtube.com/watch?v=xxWaa-lPR-8

Additionally, you can avoid all those calls to mlock on individual pages in your mmap() implementation by using the MAP_LOCKED flag. If you are not privileged, this may require to adjust the memlock limit.

edited May 24 '17 at 18:50

answered May 24 '17 at 08:16

Morian

275
2
7

Thanks a lot for this idea. I spent quite some time trying it the other way around (trying to get create the .bin files on a mount with hugetlbfs and measuring that). But I couldn't get it to work. Your idea is much simpler! Now I am getting nearly the same rune time numbers, with THP disabled, mmap is still a little bit slower, but not drastically. Do you have any links for more information for huge page page caches? I have other machines where I could try an experimental kernel. – Brutos May 24 '17 at 11:44
2

Transparent huge pages in the page cache (May 2016): https://lwn.net/Articles/686690/ Seems like there was also a presentation in Australia about the future of Linux page cache in January this year : https://www.youtube.com/watch?v=xxWaa-lPR-8 But so far nothing to solve your problem. – Morian May 24 '17 at 18:40
@Brutos: There is a patch set by K. Shutemov (at Intel) to [add huge page support for ext4 backed files](http://marc.info/?l=linux-kernel&m=148543199403944), patchset v6 submitted in January 2017. This might allow you to use `MAP_SHARED | MAP_NORESERVE | MAP_HUGETLB | MAP_HUGE_1GB`, assuming you booted the machine with proper `hugepages=` and `hugepagesz=` arguments to pre-reserve those, and mount the ext4 filesystem using `huge=always`. – Nominal Animal May 27 '17 at 02:02
Too bad that there doesn't seem to be a good solutions now. I'll see if I can try to boot up a custom compiled kernel soon and see if this issue will solve itself in about 5-7 years time in main stream OS kernels. – Brutos May 27 '17 at 16:06
@Morian Thanks a lot for your detailed answer. – martin s May 30 '17 at 04:38

Myst · Answer 2 · 2017-05-25T15:23:19.993

I might be wrong, but...

It seems to me that the issue isn't with mmap, but with the fact that the code maps the memory to a file.

The Linux malloc falls back to mmap for large allocations, so both memory allocation flavors essentially use the same backend (mmap)... however, the only difference is that malloc uses mmap without mapping to a specific file on the hard drive.

The syncing of the memory information to the disk might be what's causing the "slower" performance. It's similar to saving the file almost constantly.

You might consider testing mmap without the file, by using the MAP_ANONYMOUS flag (and fd == -1 on some systems) to test for any difference.

On the other hand, I'm not sure if the "slower" memory access isn't actually faster in the long run - would you lock the whole thing to sage 300Gb to the disk? How long would that take? ...

... the fact that you're doing it automatically in small increments might be a performance gain rather than a penalty.

Random mmaped memory access up to 16% slower than heap data access

New Update (2017/05/23):

2 Answers2