21

This looks like a long question because of all the context. There are 2 questions inside the novel below. Thank you for taking the time to read this and provide assistance.

Situation

I am working on a scalable datastore implementation that can support working with data files from a few KB to a TB or more in size on a 32-bit or 64-bit system.

The datastore utilizes a Copy-on-Write design; always appending new or modified data to the end of the data file and never doing in-place edits to existing data.

The system can host 1 or more database; each represented by a file on-disk.

The details of the implementation are not important; the only important detail being that I need to constantly append to the file and grow it from KB, to MB, to GB to TB while at the same time randomly skipping around the file for read operations to answer client requests.

First-Thoughts

At first glance I knew I wanted to use memory-mapped files so I could push the burden of efficiently managing the in-memory state of the data onto the host OS and out of my code.

Then all my code needs to worry about is serializing the append-to-file operations on-write, and allowing any number of simultaneous readers to seek in the file to answer requests.

Design

Because the individual data-files can grow beyond the 2GB limit of a MappedByteBuffer, I expect that my design will have to include an abstraction layer that takes a write offset and converts it into an offset inside of a specific 2GB segment.

So far so good...

Problems

This is where I started to get hung up and think that going with a different design (proposed below) might be the better way to do this.

From reading through 20 or so "memory mapped" related questions here on SO, it seems mmap calls are sensitive to wanting contiguous runs of memory when allocated. So, for example, on a 32-bit host OS if I tried to mmap a 2GB file, due to memory fragmentation, my chances are slim that mapping will succeed and instead I should use something like a series of 128MB mappings to pull an entire file in.

When I think of that design, even say using 1024MB mmap sizes, for a DBMS hosting up a few huge databases all represented by say 1TB files, I now have thousands of memory-mapped regions in memory and in my own testing on Windows 7 trying to create a few hundred mmaps across a multi-GB file, I didn't just run into exceptions, I actually got the JVM to segfault every time I tried to allocate too much and in one case got the video in my Windows 7 machine to cut out and re-initialize with a OS-error-popup I've never seen before.

Regardless of the argument of "you'll never likely handle files that large" or "this is a contrived example", the fact that I could code something up like that with those type of side effects put my internal alarm on high-alert and made consider an alternative impl (below).

BESIDES that issue, my understanding of memory-mapped files is that I have to re-create the mapping every time the file is grown, so in the case of this file that is append-only in design, it literally constantly growing.

I can combat this to some extent by growing the file in chunks (say 8MB at a time) and only re-create the mapping every 8MB, but the need to constantly be re-creating these mappings has me nervous especially with no explicit unmap feature supported in Java.

Question #1 of 2

Given all of my findings up to this point, I would dismiss memory-mapped files as a good solution for primarily read-heavy solutions or read-only solutions, but not write-heavy solutions given the need to re-create the mapping constantly.

But then I look around at the landscape around me with solutions like MongoDB embracing memory-mapped files all over the place and I feel like I a missing some core component here (I do know it allocs in something like 2GB extents at a time, so I imagine they are working around the re-map cost with this logic AND helping to maintain sequential runs on-disk).

At this point I don't know if the problem is Java's lack of an unmap operation that makes this so much more dangerous and unsuitable for my uses or if my understanding is incorrect and someone can point me North.

Alternative Design

An alternative design to the memory-mapped one proposed above that I will go with if my understanding of mmap is correct is as follows:

Define a direct ByteBuffer of a reasonable configurable size (2, 4, 8, 16, 32, 64, 128KB roughly) making it easily compatible with any host platform (don't need to worry about the DBMS itself causing thrashing scenarios) and using the original FileChannel, perform specific-offset reads of the file 1 buffer-capacity-chunk at a time, completely forgoing memory-mapped files at all.

The downside being that now my code has to worry about things like "did I read enough from the file to load the complete record?"

Another down-side is that I don't get to make use of the OS's virtual memory logic, letting it keep more "hot" data in-memory for me automatically; instead I just have to hope the file cache logic employed by the OS is big enough to do something helpful for me here.

Question #2 of 2

I was hoping to get a confirmation of my understanding of all of this.

For example, maybe the file cache is fantastic, that in both cases (memory mapped or direct reads), the host OS will keep as much of my hot data available as possible and the performance difference for large files is negligible.

Or maybe my understanding of the sensitive requirements for memory-mapped files (contiguous memory) are incorrect and I can ignore all that.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
Riyad Kalla
  • 10,604
  • 7
  • 53
  • 56
  • If you have gained some insights since asking your question, please post them as an answer. A lot of people read this question and they could use the insight. There's a ton of "won't fix" bugs surrounding mmapping, like http://bugs.sun.com/view_bug.do?bug_id=6893654 (although JVM segfault and graphics driver crashing are even worse!) It's interesting how a simple, elegant native feature becomes complex and ugly in the managed world. – Aleksandr Dubinsky Dec 19 '13 at 21:56
  • @AleksandrDubinsky you are exactly right (about elegant becoming inelegant) -- my final findings is that mmap'ed files could not be created quickly without introducing significant instability into the system (I don't know if I clarified in this thread, but I managed to blue-screen my windows dev machine). This detail ALONE made me want to stick to AsyncFileChannel use for file I/O and avoid mmap all together, although Peter (below) has had significant success in Chronicle. – Riyad Kalla Dec 23 '13 at 17:52
  • @AleksandrDubinsky Once I was able to bring both the VM and my machine to it's knees with apparent "mis-use" of mmapped files, I was done with going down that path. They are elegant and offer fantastic performance, but from more reading I did on AsyncFileChannel it seems you can get pretty close to the same performance (allowing the OS to utilize the FS and disk controller and I/O ordering to optimize requests). If you really want to go down the mmap path, Peter is the expert here. – Riyad Kalla Dec 23 '13 at 17:54

2 Answers2

15

You might be interested in https://github.com/peter-lawrey/Java-Chronicle

In this I create multiple memory mappings to the same file (the size is a power of 2 up to 1 GB) The file can be any size (up to the size of your hard drive)

It also creates an index so you can find any record at random and each record can be any size.

It can be shared between processes and used for low latency events between processes.

I make the assumption you are using a 64-bit OS if you want to use large amounts of data. In this case a List of MappedByteBuffer will be all you ever need. It makes sense to use the right tools for the job. ;)

I have found it performance well even with data sizes around 10x your main memory size (I was using a fast SSD drive so YMMV)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • didn't realize you were the Chronicle author; thanks for the reply. How do you handle writing to the file, is it via the MBB's or do you just call the FileChannel directly and each time a read op comes in, outside the bounds of the furthers MBB, you just create a new one and add it to your dataBuffers list? A core detail I'm missing is what *lots* of large mapped files does to the host OS's memory usage. (cont in next comment...) – Riyad Kalla Feb 14 '12 at 01:03
  • since there seems to be a requirement of "contiguous ram" when mem-mapping a file, say I decide on something safe like 64 or 128MB and as the DB file grows and requests come in for data beyond the existing mapped bounds I just create more on the fly. Then let's say my data file gets to 100s of GBs and I have 100s if not 1000 mem-mapped byte buffers... it seems like I am setting up my host computer to start paging like crazy as VM gets filled up. I want to be aware of gotcha-cases and downsides is the crux of what I'm asking. – Riyad Kalla Feb 14 '12 at 01:07
  • Each Memory mapped file is somewhat expensive (I don't have exact details) I know if you create lots of 1 MB mappings you run out of resources pretty quickly. However if you use 1 GB buffers you can create a 8 TB file. You can determine how much is too much for your system by creating lots of little ones (e.g. 4 KB) – Peter Lawrey Feb 14 '12 at 09:44
  • Making the buffers too large isn't such a problem. It only allocates to memory or disk the pages you actually use. This means you can make it a 1 GB for data and the index, but do a `du ` and find its only using 8 KB. So the temptation is t make them as large as possible. The downside is that creating them is expensive (there is some work which is proportional to the size of the mapping) For this reason I makes them a moderate size like 16 MB or 256 MB to reduce the hit incurred on a growth. – Peter Lawrey Feb 14 '12 at 09:47
  • 1
    I have looked at growing the mapping in a background thread, while much quicker, I found this leads to random BUS errors. :( It appears the mapping cannot be immediately used in a different thread to the one which created it. Even freeing it in a different thread can lead to a crash. – Peter Lawrey Feb 14 '12 at 09:49
2

I think you shouldn't worry about mmap'ping files up to 2GB in size.

Looking at the sources of MongoDB as an example of DB making use of memory mapped files you'll find it always maps full data file in MemoryMappedFile::mapWithOptions() (which calls MemoryMappedFile::map()). DB data spans across multiple files each up to 2GB in size. Also it preallocates data files so there's no need to remap as the data grows and this prevents file fragmentation. Generally you can inspire yourself with the source code of this DB.

pingw33n
  • 12,292
  • 2
  • 37
  • 38
  • 1
    @Thomas I've updated the links but I think that code is pretty much outdated, MongoDB has undergone a lot of changes since then. – pingw33n Jun 22 '15 at 09:59