19

I am writing a program that receives huge amounts of data (in pieces of different sizes) from the network, processes them and writes them to memory. Since some pieces of data can be very large, my current approach is limiting the buffer size used. If a piece is larger than the maximum buffer size, I write the data to a temporary file and later read the file in chunks for processing and permanent storage.

I'm wondering if this can be improved. I've been reading about mmap for a while but I'm not one hundred percent sure if it can help me. My idea is to use mmap for reading the temporary file. Does this help in any way? The main thing I'm concerned about is that an occasional large piece of data should not fill up my main memory causing everything else to be swapped out.

Also, do you think the approach with temporary files is useful? Should I even be doing that or, perhaps, should I trust the linux memory manager to do the job for me? Or should I do something else altogether?

Elektito
  • 3,863
  • 8
  • 42
  • 72
  • How big is 'big'? Most importantly, how does it compare to the total real RAM on the computer where this will run? – zwol Apr 24 '12 at 20:35
  • Big is several gigabytes. I have 24G of RAM so some files can occupy as much as a quarter of the physical RAM or even more. – Elektito Apr 24 '12 at 20:37
  • 1
    Basically, by using `mmap()`, you are causing that memory to be backed by a file, instead of being backed by swap (so-called anonymous memory). Under memory pressure, the kernel may decide to reclaim file-backed memory more aggresively than anonymous memory, or it may do the reverse, I don't know. – ninjalj Apr 24 '12 at 23:12
  • "Under memory pressure, the kernel may decide to reclaim file-backed memory more aggresively than anonymous memory, or it may do the reverse, I don't know." So which one is it? Does the kernel reclaim file-backed memory or swap aggressively under memory pressure? – Gokul Aug 30 '18 at 18:09

3 Answers3

18

Mmap can help you in some ways, I'll explain with some hypothetical examples:

First thing: Let's say you're running out of memory, and your application that have a 100MB chunk of malloc'ed memory get 50% of it swapped out, that means that the OS had to write 50MB to the swapfile, and if you need to read it back, you have written, occupied and then read it back again 50MB of your swapfile.

In case the memory was just mmap'ed, the operating system will not write that piece of information to the swapfile (as it knows that that data is identical to the file itself), instead, it will just scratch 50MB of information (again: supposing you have not written anything for now) and that's that. If you ever need that memory to be read again, the OS will fetch the contents not from the swapfile, but from the original file you've mmaped, so if any other program needs 50MB of swap, they're available. Also there is not overhead with swapfile manipulation at all.

Let's say you read a 100MB chunk of data, and according to the initial 1MB of header data, the information that you want is located at offset 75MB, so you don't need anything between 1~74.9MB! You have read it for nothing but to make your code simpler. With mmap, you will only read the data you have actually accessed (rounded 4kb, or the OS page size, which is mostly 4kb), so it would only read the first and the 75th MB. I think it's very hard to make a simpler and more effective way to avoid disk reading than mmaping files. And if by some reason you need the data at offset 37MB, you can just use it! You don't have to mmap it again, as the whole file is accessible in memory (of course limited by your process' memory space).

All files mmap'ed are backed up by themselves, not by the swapfile, the swapfile is made to grant data that doesn't have a file to back up, which usually is data malloc'ed or data that is backed up by a file, but it was altered and [can not/shall not] be written back to it before the program actually tells the OS to do so via a msync call.

Beware that you don't need to map the whole file in the memory, you can map any amount (2nd arg is "size_t length") starting from any place (6th arg - "off_t offset"), but unless your file is likely to be enormous, you can safely map 1GB of data with no fear, even if the system only packs 64mb of physical memory, but that's for reading, if you plan on writing then you should be more conservative and map only the stuff that you need.

Mapping files will help you making your code simpler (you already have the file contents on the memory, ready to use, with much less memory overhead since it's not anonymous memory) and faster (you will only read the data that your program accessed).

Carlos Lint
  • 196
  • 2
  • Thank you. It's good to know all this, though unfortunately most of this does not apply to my current situation. – Elektito Apr 25 '12 at 22:23
3

The main advantage of mmap with big files is to share the same memory mapping between two or more file: if you mmap with MAP_SHARED, it will be loaded into memory only once for all the processes that will use the data with the memory saving.

But AFAIK , mmap maps the entire file into memory (Here you can find examples of how mmap fails with files bigger than physical mem + swap space.) so if you access the file from a single process, it will not help you with the physical memory consumption.

Community
  • 1
  • 1
  • So is there another way I can make sure not all of a file is loaded into memory? You see, I have another problem, too. I need to send the data for storage in MongoDB. Now Mongo needs me to give it a pointer to some in-memory buffer and so it seems that whether I load the file myself or use mmap, the file will be stored in memory in its entirity for a period of time. – Elektito Apr 24 '12 at 19:42
  • 2
    I'm not familiar with MongoDB, but if it wants an in-memory buffer containing the entire file, then it seems to me there's no point in using temporary files at all. If the behavior when you read straight from the network into memory buffers and then pass those to MongoDB is unacceptable, I think you're going to have to break your large files into chunks *within the database*. – zwol Apr 24 '12 at 21:45
  • 4
    mmap indeed does "map the entire file into memory", but it does not *read it from disk into memory* to do so. Mapping files bigger than physical mem + swap space might fail only if you use specified flags or under very specific kernel configurations (which are not commonly used) or if you try to mmap files with total size bigger than your *virtual* memory. Virtual memory exhaustion is the real threat on 32-bit systems, but anything else should not cause mmap to fail when you do it right way. – user1643723 Nov 29 '17 at 02:33
1

I believe mmap doesn't require all data to be in memory at the same moment - it uses the page cache to keep recently used pages in memory, and the rest on disk.

If you are reading one chunk at a time, using a temporary file probably won't help you, but if you are reading multiple chunks concurrently using multiple threads, processes, or using select/poll, then it might.

user1277476
  • 2,871
  • 12
  • 10