Since you are simply streaming the data and never rereading it, the page cache does you no good whatsoever. In fact, given the amount of data you're pushing through the page cache, and the memory pressure from your application, otherwise useful data is likely evicted from the page cache and your system performance suffers because of that.
So don't use the cache when reading your data. Use direct IO. Per the Linux open()
man page:
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT
flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC
must be
used in addition to O_DIRECT
. See NOTES below for further
discussion.
...
NOTES
...
O_DIRECT
The O_DIRECT flag may impose alignment restrictions on the length and
address of user-space buffers and the file offset of I/Os. In Linux
alignment restrictions vary by filesystem and kernel version and
might be absent entirely. However there is currently no
filesystem-independent interface for an application to discover these
restrictions for a given file or filesystem. Some filesystems
provide their own interfaces for doing so, for example the
XFS_IOC_DIOINFO operation in xfsctl(3).
Under Linux 2.4, transfer sizes, and the alignment of the user buffer
and the file offset must all be multiples of the logical block size
of the filesystem. Since Linux 2.6.0, alignment to the logical block
size of the underlying storage (typically 512 bytes) suffices. The
logical block size can be determined using the ioctl(2) BLKSSZGET
operation or from the shell using the command:
blockdev --getss
...
Since you are not reading the data over and over, direct IO is likely to improve performance somewhat, as the data will go directly from disk into your application's memory instead of from disk, to the page cache, and then into your application's memory.
Use low-level, C-style I/O with open()
/read()
/close()
, and open the file with the O_DIRECT
flag:
int fd = ::open( filename, O_RDONLY | O_DIRECT );
This will result in the data being read directly into the application's memory, without being cached in the system's page cache.
You'll have to read()
using aligned memory, so you'll need something like this to actually read the data:
char *buffer;
size_t pageSize = sysconf( _SC_PAGESIZE );
size_t bufferSize = 32UL * pageSize;
int rc = ::posix_memalign( ( void ** ) &buffer, pageSize, bufferSize );
posix_memalign()
is a POSIX-standard function that returns a pointer to memory aligned as requested. Page-aligned buffers are usually more than sufficient, but aligning to hugepage size (2MiB on x86-64) will hint the kernel that you want transparent hugepages for that allocation, making access to your buffer more efficient when you read it later.
ssize_t bytesRead = ::read( fd, buffer, bufferSize );
Without your code, I can't say how to get the data from buffer
into your std::vector
, but it shouldn't be hard. There are likely ways to wrap the C-style low-level file descriptor with a C++ stream of some type, and to configure that stream to use memory properly aligned for direct IO.
If you want to see the difference, try this:
echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file of=/dev/null bs=32k
Time that. Then look at the amount of data in the page cache.
Then do this:
echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file iflag=direct of=/dev/null bs=32k
Check the amount of data in the page cache after that...
You can experiment with different block sizes to see what works best on your hardware and filesystem.
Note well, though, that direct IO is very implementation-dependent. Requirements to perform direct IO can vary significantly between different filesystems, and performance can vary drastically depending on your IO pattern and your specific hardware. Most of the time it's not worth those dependencies, but the one simple use where it usually is worthwhile is streaming a huge file without rereading/rewriting any part of the data.