I'm dealing with a large file (>500GB, few columns but several lines), and I need to get some lines from there. I have this list of start byte and endbytes (measured from the beginning of the file) for the parts that I need, something like:
A,0,500
B,1089,4899
Here's the thing: I have to do this around 1.2 M times. Which is better for performance: always starting from the beginning of the document, or counting from current position? So it would be something like:
with open(large_file, 'r') as f:
for start_byte, end_byte in byte_list:
f.seek(start_byte) # always start from beginning of file
chunk_str = f.read(end_byte-start_byte)
or
with open(large_file, 'r') as f:
current_pos = 0
for start_byte, end_byte in byte_list:
f.seek(start_byte - current_pos, 1) # seek from current position
chunk_str = f.read(end_byte-start_byte)
current_pos = end_byte
Or does it even matter at all? I've read How does Python's seek function work? but I'm not technically proficient enough to understand how this would affect reading very large text files.