I want to speed up thousands of seeks and reads increasingly forward in a 1G contiguous section of a huge file(~10G) at a known offset(say 8G). i.e. the pattern looks like: seek(8.00G), read(128 bytes), seek(8.01G), read(56 bytes)
, and onward.
Basic Pseudocode - Approach 1
int known_offset = 8*1024*1024;
std::ifstream fs("fname", std::ios_base::in | std::ios_base::binary)
for(int i = 0; i < 1000; i++) {
fs->seekg(known_offset, std::ios::beg);
fs->read(buf, 128);
known_offset += calculate_fwd_offset_based_on_buf(buf)
}
Possible Improvement - Approach 2
I thought about increasing the ifstream read buffer size so that it prefetches and fills the buffer, so that the following seeks use the buffer to avoid/reduce disk seeks. Something along the lines of:
char buf[8*1024*1024];
std::ifstream fs("fname", std::ios_base::in | std::ios_base::binary)
fs.rdbuf()->pubsetbuf(buf, 8*1024*1024);
// Remaining code remains same
Now I ran strace/time
commands on the approaches from method (1) and (2). The time taken is longer for approach (2) and exactly the same number of LSEEK/read calls happen in both with larger reads happening in case (2) which is my guess as to why its taking longer. Is there any way to have seekg/read
work as expected?
Inconvenient Solution
I can get the above approach to work by removing seekg(next_offset)
and doing a fs.read(buf, num_chars_to_next_offset)
instead. This reduces the number of reads from 9000 to ~40. I would really prefer that it work for seeks though as I am unnecessarily copying characters into a noop buffer this way.