Bad Linux Memory Mapped File Performance with Random Access C++ & Python

Question

While trying to use memory mapped files to create a multi-gigabyte file (around 13gb), I ran into what appears to be a problem with mmap(). The initial implementation was done in c++ on Windows using boost::iostreams::mapped_file_sink and all was well. The code was then run on Linux and what took minutes on Windows became hours on Linux.

The two machines are clones of the same hardware: Dell R510 2.4GHz 8M Cache 16GB Ram 1TB Disk PERC H200 Controller.

The Linux is Oracle Enterprise Linux 6.5 using the 3.8 kernel and g++ 4.83.

There was some concern that there may be a problem with the boost library, so implementations were done with boost::interprocess::file_mapping and again with native mmap(). All three show the same behavior. The Windows and Linux performance is on par to a certain point when the Linux performance falls off badly.

Full source code and performance numbers are linked below.

// C++ code using boost::iostreams
void IostreamsMapping(size_t rowCount)
{
   std::string outputFileName = "IoStreamsMapping.out";
   boost::iostreams::mapped_file_params params(outputFileName);
   params.new_file_size = static_cast<boost::iostreams::stream_offset>(sizeof(uint64_t) * rowCount);
   boost::iostreams::mapped_file_sink fileSink(params); // NOTE: using this form of the constructor will take care of creating and sizing the file.
   uint64_t* dest = reinterpret_cast<uint64_t*>(fileSink.data());
   DoMapping(dest, rowCount);
}

void DoMapping(uint64_t* dest, size_t rowCount)
{
   inputStream->seekg(0, std::ios::beg);
   uint32_t index, value;
   for (size_t i = 0; i<rowCount; ++i)
   {
      inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
      inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
      dest[index] = value;
   }
}

One final test was done in Python to reproduce this in another language. The fall off happened at the same place, so looks like the same problem.

# Python code using numpy
import numpy as np
fpr = np.memmap(inputFile, dtype='uint32', mode='r', shape=(count*2))
out = np.memmap(outputFile, dtype='uint64', mode='w+', shape=(count))
print("writing output")
out[fpr[::2]]=fpr[::2]

For the c++ tests Windows and Linux have similar performance up to around 300 million int64s (with Linux looking slightly faster). It looks like performance falls off on Linux around 3Gb (400 million * 8 bytes per int64 = 3.2Gb) for both C++ and Python.

I know on 32-bit Linux that 3Gb is a magic boundary, but am unaware of similar behavior for 64-bit Linux.

The gist of the results is 1.4 minutes for Windows becoming 1.7 hours on Linux at 400 million int64s. I am actually trying to map close to 1.3 billion int64s.

Can anyone explain why there is such a disconnect in performance between Windows and Linux?

Any help or suggestions would be greatly appreciated!

original mmap_test.py

Updated Results With updated Python code...Python speed now comparable with C++

Original Results NOTE: The Python results are stale

Can you try using madvise() and see if it changes anything? You may have to try various advice parameters: http://man7.org/linux/man-pages/man2/madvise.2.html — John Zwinck, Oct 14 '14 at 13:03
I would suggest to check the behavior on 64-bit Linux and see the result. — Mine, Oct 14 '14 at 13:29
@JohnZwinck I tried advise random (since random access is what is being done), but it didn't seem to help...and in some cases seemed to make it worse. — shao.lo, Oct 14 '14 at 13:30
And why are you using mmap at all? Meaning, what's the higher-level problem that you can't solve with regular reading from files? — John Zwinck, Oct 14 '14 at 13:33
@JohnZwinck Using memory mapped IO provides a nicer cross platform experience especially when using the boost APIs. — shao.lo, Oct 14 '14 at 13:37
This post http://stackoverflow.com/questions/19469496/file-fragmentation-caused-by-file-mapping seems to indicate there could be a fragmentation problem...looking into it. — shao.lo, Oct 14 '14 at 15:28
fyi, an integer in numpy array with dtype `uint32` occupies exactly 4 bytes. A single new Python integer (infinite precision, more than 4 bytes) is created every time you access the array using a scalar index (the boxing/unboxing is expensive, that is why vector operations should be used: `out[fpr[::2]]=fpr[::2]`) — jfs, Oct 15 '14 at 00:04
@J.F.Sebastian Thanks for the Python information and performance tips! I've updated the code based on this feedback. I'll rerun the python tests and update results shortly. — shao.lo, Oct 15 '14 at 13:06
@MatsPetersson With regard to your non-answer: Unfortunately the data will usually be too big for memory (in the case that it fits, reading it all in at once is exactly what I do and it works well). — shao.lo, Oct 15 '14 at 13:35
@MatsPetersson That is some excellent information on the dirty pages! I'll do more testing based on this information and update my findings. Thanks!! — shao.lo, Oct 15 '14 at 13:36

score 8 · Accepted Answer · edited Apr 13 '17 at 12:13

Edit: Upgrading to "proper answer". The problem is with the way that "dirty pages" are handled by Linux. I still want my system to flush dirty pages now and again, so I didn't allow it to have TOO many outstanding pages. But at the same time, I can show that this is what is going on.

I did this (with "sudo -i"):

# echo 80 > /proc/sys/vm/dirty_ratio
# echo 60 > /proc/sys/vm/dirty_background_ratio

Which gives these settings VM dirty settings:

grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:60
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:80
/proc/sys/vm/dirty_writeback_centisecs:500

This makes my benchmark run like this:

$ ./a.out m64 200000000
Setup Duration 33.1042 seconds
Linux: mmap64
size=1525 MB
Mapping Duration 30.6785 seconds
Overall Duration 91.7038 seconds

Compare with "before":

$ ./a.out m64 200000000
Setup Duration 33.7436 seconds
Linux: mmap64
size=1525
Mapping Duration 1467.49 seconds
Overall Duration 1501.89 seconds

which had these VM dirty settings:

grep ^ /proc/sys/vm/dirty*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

I'm not sure exactly what settings I should use to get IDEAL performance whilst still not leaving all dirty pages sitting around in memory forever (meaning that if the system crashes, it takes much longer to write out to disk).

For history: Here's what I originally wrote as a "non-answer" - some comments here still apply...

Not REALLY an answer, but I find it rather interesting that if I change the code to first read the entire array, and the write it out, it's SIGNIFICANTLY faster, than doing both in the same loop. I appreciate that this is utterly useless if you need to deal with really huge data sets (bigger than memory). With the original code as posted, the time for 100M uint64 values is 134s. When I split the read and the write cycle, it's 43s.

This is the DoMapping function [only code I've changed] after modification:

struct VI
{
    uint32_t value;
    uint32_t index;
};


void DoMapping(uint64_t* dest, size_t rowCount)
{
   inputStream->seekg(0, std::ios::beg);
   std::chrono::system_clock::time_point startTime = std::chrono::system_clock::now();
   uint32_t index, value;
   std::vector<VI> data;
   for(size_t i = 0; i < rowCount; i++)
   {
       inputStream->read(reinterpret_cast<char*>(&index), static_cast<std::streamsize>(sizeof(uint32_t)));
       inputStream->read(reinterpret_cast<char*>(&value), static_cast<std::streamsize>(sizeof(uint32_t)));
       VI d = {index, value};
       data.push_back(d);
   }
   for (size_t i = 0; i<rowCount; ++i)
   {
       value = data[i].value;
       index = data[i].index;
       dest[index] = value;
   }
   std::chrono::duration<double> mappingTime = std::chrono::system_clock::now() - startTime;
   std::cout << "Mapping Duration " << mappingTime.count() << " seconds" << std::endl;
   inputStream.reset();
}

I'm currently running a test with 200M records, which on my machine takes a significant amount of time (2000+ seconds without code-changes). It is very clear that the time taken is from disk-I/O, and I'm seeing IO-rates of 50-70MB/s, which is pretty good, as I don't really expect my rather unsophisticated setup to deliver much more than that. The improvement is not as good with the larger size, but still a decent improvement: 1502s total time, vs 2021s for the "read and write in the same loop".

Also, I'd like to point out that this is a rather terrible test for any system - the fact that Linux is notably worse than Windows is beside the point - you do NOT really want to map a large file and write 8 bytes [meaning the 4KB page has to be read in] to each page at random. If this reflects your REAL application, then you seriously should rethink your approach in some way. It will run fine when you have enough free memory that the whole memory-mapped region fits in RAM.

There is plenty of RAM in my system, so I believe that the problem is that Linux doesn't like too many mapped pages that are "dirty".

I have a feeling that this may have something to do with it: https://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages More explanation: http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Unfortunately, I'm also very tired, and need to sleep. I'll see if I can experiment with these tomorrow - but don't hold your breath. Like I said, this is not REALLY an answer, but rather a long comment that doesn't really fit in a comment (and contains code, which is completely rubbish to read in a comment)

Bad Linux Memory Mapped File Performance with Random Access C++ & Python

1 Answers1

Linked