0

I want to speed up thousands of seeks and reads increasingly forward in a 1G contiguous section of a huge file(~10G) at a known offset(say 8G). i.e. the pattern looks like: seek(8.00G), read(128 bytes), seek(8.01G), read(56 bytes), and onward.

Basic Pseudocode - Approach 1

int known_offset = 8*1024*1024;
std::ifstream fs("fname", std::ios_base::in | std::ios_base::binary)
for(int i = 0; i < 1000; i++) {
    fs->seekg(known_offset, std::ios::beg);
    fs->read(buf, 128);
    known_offset += calculate_fwd_offset_based_on_buf(buf)
}

Possible Improvement - Approach 2
I thought about increasing the ifstream read buffer size so that it prefetches and fills the buffer, so that the following seeks use the buffer to avoid/reduce disk seeks. Something along the lines of:

char buf[8*1024*1024];
std::ifstream fs("fname", std::ios_base::in | std::ios_base::binary)
fs.rdbuf()->pubsetbuf(buf, 8*1024*1024);
// Remaining code remains same

Now I ran strace/time commands on the approaches from method (1) and (2). The time taken is longer for approach (2) and exactly the same number of LSEEK/read calls happen in both with larger reads happening in case (2) which is my guess as to why its taking longer. Is there any way to have seekg/read work as expected?

Inconvenient Solution
I can get the above approach to work by removing seekg(next_offset) and doing a fs.read(buf, num_chars_to_next_offset) instead. This reduces the number of reads from 9000 to ~40. I would really prefer that it work for seeks though as I am unnecessarily copying characters into a noop buffer this way.

tangy
  • 3,056
  • 2
  • 25
  • 42
  • Did you look into [file mapping](https://learn.microsoft.com/en-us/windows/desktop/memory/file-mapping)? – AlexG Feb 28 '19 at 00:59
  • I cant use mmap for this specific use case - Have to use iostreams due to some later processing being done with `boost::iostreams` filters and other reasons. – tangy Feb 28 '19 at 01:01

0 Answers0