9

In my C++ program running in linux (Ubuntu 14.4), I need to read a 90 GB file completely buffered in a C++ vector, and I have only 125 GB memory.

When I read the file chunk by chunk, it continuously results in increase in cached mem usage in linux, which turns out to be more than 50% of the 128 GB mem, then the free memory easily becomes under 50 GB.

              total        used        free      shared  buff/cache   available
Mem:            125          60         0           0          65         65

Swap: 255 0 255

So I found that the free memory then becomes zero, and the file reading process almost got stopped, and I have to manually run:

echo 3 | sudo tee /proc/sys/vm/drop_caches

to clear cached mem, so that the file reading process resumes. I understand cached mem is to speed up reading a file again. My question is how could I avoid manually running the drop cache command to ensure the file reading process would successfully complete?

KjMag
  • 2,650
  • 16
  • 16
Ren Chen
  • 155
  • 5
  • 2
    have you seen this: https://serverfault.com/questions/288319/linux-not-freeing-large-disk-cache-when-memory-demand-goes-up You're searching to a workaround for a problem that shouldn't exist. The disk cache shouldn't be growing that large when there is other memory pressure. Also, presizing the vector may discourage the disk cache from growing so large. – xaxxon Jul 07 '17 at 09:04
  • 5
    Just memory map this file instead. – user7860670 Jul 07 '17 at 09:05
  • @VTT that may not provide the desired performance. – xaxxon Jul 07 '17 at 09:06
  • @xaxxon I expect this approach to provide the same or somewhat better performance. – user7860670 Jul 07 '17 at 09:08
  • @vtt Can it guarantee that an arbitrary read at any point in time after mmaping is completed is served directly from memory? – xaxxon Jul 07 '17 at 09:09
  • 1
    @xaxxon If `MAP_POPULATE` flag is used for `mmap` call then all the mapped data will be prefetched into memory. – user7860670 Jul 07 '17 at 09:11
  • @VTT very cool, I learned something :) I recommend using mmap as well, now that I know this, as it is likely to be highly optimized for whatever platform you are running on - better tweaked than whatever C++ code would be written in user-space. I am still quite interested in what is going on with the disk cache growing like that, though. it may be worth your while to post this to poweruser as well (after checking through the things in the link I posted above to make sure it's not one of those things) – xaxxon Jul 07 '17 at 09:15
  • Do you `reserve()` storage for vector beforehand? – geza Jul 07 '17 at 09:37
  • Also this thread might help you: https://stackoverflow.com/questions/6818606/how-to-programmatically-clear-the-filesystem-memory-cache-in-c-on-a-linux-syst?rq=1 – KjMag Jul 07 '17 at 10:53
  • Thanks for all the comments. I did reserve() for the buffer vector which has size of around 500M, so I need to do this for hundreds of times: fin.read( reinterpret_cast(&buffer_g[0]), buffer_size*sizeof(base_t)); the buffer_g is the vector. – Ren Chen Jul 07 '17 at 17:00
  • @xaxxon A page of a vector might have been swapped out, just as easily. You would use `posix_madvise()`, `posix_fadvise()` or equivalent to tell the OS which pages to swap in and what replacement policy to use. – Davislor Jul 07 '17 at 21:15
  • @geza that's what I suggested the very first comment.. – xaxxon Jul 08 '17 at 02:18
  • @xaxxon: Presizing the vector helps that the application will use less memory, it has nothing to do disk cache. And in this case, it could matter much, if one looks at the numbers. With `reserve()`, everything should be fine. Without `reserve()`, there could be that at some point, the application allocates more than 125GB memory, so it begins to use the swap. I agree with you in the first part of your comment: linux should drop file caches immediately if an application needs more memory. It is surprising that it doesn't do it (if the OP right). – geza Jul 08 '17 at 09:46
  • @VTT any thoughts on the answer and comments below? – xaxxon Jul 08 '17 at 20:07
  • @geza yeah, I know. That's why I told him to do it in the first comment on the post. Just curious why you felt the need to say it again much later. – xaxxon Jul 09 '17 at 06:22
  • @RenChen: What's happening when the reading process slows down? Is it paging out your process's memory to disk? (check with `vmstat 5` or `dstat`). Maybe try adjusting `/proc/sys/vm/swappiness` to 20 or 10 instead of the default 70, so the kernel is less eager to swap dirty pages from your process to make room for pagecache. – Peter Cordes Jul 09 '17 at 09:01
  • I see you have 255GB of swap, which is insanely huge. Some swap space (like 1 or 2GB) is good even with lots of RAM, but unless your system actually uses that much swap sometimes, it's way overkill and a waste of disk space. It even wastes a tiny bit of RAM keeping track of all those available swap pages. – Peter Cordes Jul 09 '17 at 09:02
  • Do you modify the vector after reading it from disk? If not, `mmap` is probably good (like @VTT is suggesting), so the file can stay cached in memory instead of having to be re-read when your process restarts. (One downside: prevents using hugepages). If you only modify a small fraction of the total pages in the file, a private mapping could work well. – Peter Cordes Jul 09 '17 at 09:06

1 Answers1

3

Since you are simply streaming the data and never rereading it, the page cache does you no good whatsoever. In fact, given the amount of data you're pushing through the page cache, and the memory pressure from your application, otherwise useful data is likely evicted from the page cache and your system performance suffers because of that.

So don't use the cache when reading your data. Use direct IO. Per the Linux open() man page:

O_DIRECT (since Linux 2.4.10)

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user- space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT. See NOTES below for further discussion.

...

NOTES

...

O_DIRECT

The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. However there is currently no filesystem-independent interface for an application to discover these restrictions for a given file or filesystem. Some filesystems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3).

Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the filesystem. Since Linux 2.6.0, alignment to the logical block size of the underlying storage (typically 512 bytes) suffices. The logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command:

      blockdev --getss

...

Since you are not reading the data over and over, direct IO is likely to improve performance somewhat, as the data will go directly from disk into your application's memory instead of from disk, to the page cache, and then into your application's memory.

Use low-level, C-style I/O with open()/read()/close(), and open the file with the O_DIRECT flag:

int fd = ::open( filename, O_RDONLY | O_DIRECT );

This will result in the data being read directly into the application's memory, without being cached in the system's page cache.

You'll have to read() using aligned memory, so you'll need something like this to actually read the data:

char *buffer;
size_t pageSize = sysconf( _SC_PAGESIZE );
size_t bufferSize = 32UL * pageSize;

int rc = ::posix_memalign( ( void ** ) &buffer, pageSize, bufferSize );

posix_memalign() is a POSIX-standard function that returns a pointer to memory aligned as requested. Page-aligned buffers are usually more than sufficient, but aligning to hugepage size (2MiB on x86-64) will hint the kernel that you want transparent hugepages for that allocation, making access to your buffer more efficient when you read it later.

ssize_t bytesRead = ::read( fd, buffer, bufferSize );

Without your code, I can't say how to get the data from buffer into your std::vector, but it shouldn't be hard. There are likely ways to wrap the C-style low-level file descriptor with a C++ stream of some type, and to configure that stream to use memory properly aligned for direct IO.

If you want to see the difference, try this:

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file of=/dev/null bs=32k

Time that. Then look at the amount of data in the page cache.

Then do this:

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file iflag=direct of=/dev/null bs=32k

Check the amount of data in the page cache after that...

You can experiment with different block sizes to see what works best on your hardware and filesystem.

Note well, though, that direct IO is very implementation-dependent. Requirements to perform direct IO can vary significantly between different filesystems, and performance can vary drastically depending on your IO pattern and your specific hardware. Most of the time it's not worth those dependencies, but the one simple use where it usually is worthwhile is streaming a huge file without rereading/rewriting any part of the data.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
  • why do you like this solution more than mmap? – xaxxon Jul 08 '17 at 02:18
  • 1
    @xaxxon It's *faster* than `mmap()`, and `mmap()` won't do anything to reduce memory pressure. `mmap()` is way too often treated as some kind of magic bullet when it's nothing of the sort. When streaming a huge file *once*, it's one of the *worst* ways to read data. [Read what one Linus Torvalds has to say](http://marc.info/?l=linux-kernel&m=95496636207616&w=2) [my bolding]: "People love mmap() and other ways to play with the page tables to optimize away a copy operation, and **sometimes** it is worth it. HOWEVER, playing games with the virtual memory mapping is very expensive in itself." – Andrew Henle Jul 08 '17 at 12:23
  • presumably he wants to use the data a lot, that's why he's reading it into memory. The idea is that if you mmap it, that lets the OS decide the best way to get it and keep it in memory for you instead of trying to fuss with options to read and a vector. – xaxxon Jul 09 '17 at 01:10
  • @xaxxon You're assuming the data as stored in the file can be used directly. And `mmap()` is *slow*. The virtual memory mappings that must get made for `mmap()`'ing a file are *slow*. To quote Torvalds again: "Downsides to mmap: - quite noticeable setup and teardown costs. And I mean _noticeable_. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff. - page faulting is expensive. That's how the mapping gets populated, and it's quite slow." – Andrew Henle Jul 09 '17 at 01:42
  • To continue, this is what prompted Torvalds response: "I was very disheartened to find that on my system the mmap/mlock approach took *3 TIMES* as long as the read solution. It seemed to me that mmap/mlock should be at least as fast as read. Comments are invited." – Andrew Henle Jul 09 '17 at 01:43
  • " - if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread. " – xaxxon Jul 09 '17 at 06:20
  • @AndrewHenle: In some limited testing on a Skylake desktop running Linux on bare metal (no virtualization overhead for page tables), I've found that `mmap` is often a win for moderate-size files (like 100MB) if the data is expected to be hot in the pagecache. I was testing the time to get a byte-swapped copy of the data into user-space memory (since my real use case has to read big-endian data). read() in small chunks and then looping back over while the data is hot in L1 or L2 is pretty good (with hand-vectorized AVX2 bswap), but slightly slower than copy+swap from an mmaped region. – Peter Cordes Jul 09 '17 at 08:40
  • The same destination buffer was being rewritten every time, though, so neither case was testing the overhead of touching fresh memory for the destination. It seems that the copy-to-user in `read(2)` also has to do some TLB invalidation or something, since I was seeing a lot of TLB misses in the read version, especially when I explored the phenomenon by making the cache-blocking size smaller than 4k (so multiple read() calls are done on the same page). I think my best guess was that `read(2)` invalidates the TLB entries for the destination buffer's pages. – Peter Cordes Jul 09 '17 at 08:44
  • Align to 2MB to allow using anonymous hugepages. Probably also use `madvise(MADV_HUGEPAGE)`, especially if you don't set `/sys/kernel/mm/transparent_hugepage/defrag` to `always`. (x86-64 hugepages are 2MB, x86-32 hugepages are 4MB). I don't think anonymous hugepages will ever use 1GiB hugepages. (Linux does use 1GiB pages for kernel-space mapping of all the RAM, but I don't think you can get that in user-space.) – Peter Cordes Jul 09 '17 at 09:11
  • @PeterCordes *if the data is expected to be hot in the pagecache* Absolutely - accessing the data many times is **THE** optimal use-case for `mmap()`. I've never said otherwise. The OP in the question, though, is reading the file once. *I think my best guess was that read(2) invalidates the TLB entries for the destination buffer's pages.* That's interesting. Since you are reading into the same buffer every time that seems to be somewhat suboptimal, I'd think. I seem to remember a question recently, though, where `mmap()` using huge pages didn't work so well on Linux. – Andrew Henle Jul 09 '17 at 10:40
  • @AndrewHenle: the reading-into-same-buffer thing was as part of a synthetic benchmark I created to loosely model the real use-case of reading a file from the filesystem once, before doing some FP math on it and exiting. I only need one pass over the file contents for the lifetime of the process, but with mmap it can be a copy+endian-swap to get some useful work done while copying. In my benchmark, I was unmapping, closing, and re-opening the file every pass over the file, otherwise mmap would be a huge win. – Peter Cordes Jul 09 '17 at 10:43
  • Anyway, for the OP's use-case, the file will fit in RAM, so if its mostly not modified, mmap may be good to allow it to stay hot in the pagecache. Except a file-backed mapping can't use hugepages, AFAIK. I commented on the question to ask for clarification. – Peter Cordes Jul 09 '17 at 10:47
  • 1
    @PeterCordes *for the OP's use-case, the file will fit in RAM* It does use a significant fraction of the RAM as stated, though, so if there are other significant memory demands, that my not be completely true for the OP. Given the OP's concern over memory usage, I provided this answer as a way to minimize memory usage. *Except a file-backed mapping can't use hugepages, AFAIK.* I believe that is correct. I haven't been able to find the question I was referring to, however. – Andrew Henle Jul 09 '17 at 11:00
  • Yeah, if you do have to read the file from the FS, direct I/O might be good. It sounds like the OP wants any other memory pressure to evict the file from the pagecache instead of paging out his processes. If the reading process does any work mixed in with the reads, though, going through the pagecache with normal `read(2)` might let it benefit from more readahead? Maybe using `posix_fadvise(POSIX_FADV_SEQUENTIAL)` or something could help hint the pagecache to drop the pages after read() gets them? Not something I've looked into myself, so IDK how well that works. – Peter Cordes Jul 09 '17 at 11:07
  • @AndrewHenle you don't suppose the point of reading the file once was to then use the data read multiple times out of memory later? – xaxxon Jul 09 '17 at 23:12