0

I'm dealing with a large file (>500GB, few columns but several lines), and I need to get some lines from there. I have this list of start byte and endbytes (measured from the beginning of the file) for the parts that I need, something like:

A,0,500
B,1089,4899

Here's the thing: I have to do this around 1.2 M times. Which is better for performance: always starting from the beginning of the document, or counting from current position? So it would be something like:

with open(large_file, 'r') as f:
    for start_byte, end_byte in byte_list:
        f.seek(start_byte) # always start from beginning of file
        chunk_str = f.read(end_byte-start_byte)

or

with open(large_file, 'r') as f:
    current_pos = 0
    for start_byte, end_byte in byte_list:
        f.seek(start_byte - current_pos, 1) # seek from current position
        chunk_str = f.read(end_byte-start_byte)
        current_pos = end_byte

Or does it even matter at all? I've read How does Python's seek function work? but I'm not technically proficient enough to understand how this would affect reading very large text files.

irene
  • 2,085
  • 1
  • 22
  • 36
  • 1
    Possible duplicate of [How does Python's seek function work?](https://stackoverflow.com/questions/13278748/how-does-pythons-seek-function-work) – mkrieger1 May 17 '18 at 16:16
  • It doesn't matter at all, the stdlib converts it in a simple operation. – tdelaney May 17 '18 at 16:17
  • @mkrieger1 I read that too, but it's more technical than I could understand. Anyway thanks for all those who answered so far. I'd go with the simpler version then (just start from the beginning). – irene May 17 '18 at 16:21

1 Answers1

2

Just use the absolute form, since an absolute byte offset is what you have. The work of actually reading from the correct location after using seek is buried in the file system driver used by your OS. seek itself does little more than set a variable.

You would use f.seek(d, 1) if you don't already know your current position, but know that you need to skip ahead by d bytes.

chepner
  • 497,756
  • 71
  • 530
  • 681