0

I want to read a large text file block by block define by offsets. Before that, I used file.tell to define the start position of each block by quick screening the file with file.seek(0), i.e., from start of the file. Then each block is defined by the size using size = i1 - i0, where i0, i1 are start positions of current and next block. The block is easy to be obtained by file.read(size) for further processing. The problem is that block defined in this way is not consistent with what is actually presented in the file. That is, there are additional texts obtained other than the block I want to get. Here is a simple example by supposing each block is a single line.

>>> # construct a simple text file
>>> texts = [f"This is line No. {i}\n" for i in range(20)]
>>> with open("test.txt", "w") as f:
        f.writelines(texts)

>>> with open("test.txt", "r") as f:
        print(f.read())

This is line No. 0
This is line No. 1
This is line No. 2
This is line No. 3
This is line No. 4
This is line No. 5
This is line No. 6
This is line No. 7
This is line No. 8
This is line No. 9
This is line No. 10
This is line No. 11
This is line No. 12
This is line No. 13
This is line No. 14
This is line No. 15
This is line No. 16
This is line No. 17
This is line No. 18
This is line No. 19

>>> # Get offset for each line
>>> offsets = []
>>> with open("test.txt", "r") as f:
        b = f.seek(0)
        for line in iter(f.readline, ""):
            offsets.append(b)
            # start position of next block
            b = f.tell()

>>> offsets
[0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 221, 242, 263, 284, 305, 326, 347, 368, 389]
>>> # Read blocks by offsets
>>> for i0, i1 in zip(offsets[:-1], offsets[1:]):
        _ = f.seek(i0, 0)
        text = f.read(i1 - i0)
        print(f"## Start = {i0}, end = {i1}, offset = {i1-i0}")
        print("Block text: ")
        print(text)

## Start = 0, end = 20, offset = 20
Block text: 
This is line No. 0
T
## Start = 20, end = 40, offset = 20
Block text: 
This is line No. 1
T
## Start = 40, end = 60, offset = 20
Block text: 
This is line No. 2
T
## Start = 60, end = 80, offset = 20
Block text: 
This is line No. 3
T
## Start = 80, end = 100, offset = 20
Block text: 
This is line No. 4
T
## Start = 100, end = 120, offset = 20
Block text: 
This is line No. 5
T
## Start = 120, end = 140, offset = 20
Block text: 
This is line No. 6
T
## Start = 140, end = 160, offset = 20
Block text: 
This is line No. 7
T
## Start = 160, end = 180, offset = 20
Block text: 
This is line No. 8
T
## Start = 180, end = 200, offset = 20
Block text: 
This is line No. 9
T
## Start = 200, end = 221, offset = 21
Block text: 
This is line No. 10
T
## Start = 221, end = 242, offset = 21
Block text: 
This is line No. 11
T
## Start = 242, end = 263, offset = 21
Block text: 
This is line No. 12
T
## Start = 263, end = 284, offset = 21
Block text: 
This is line No. 13
T
## Start = 284, end = 305, offset = 21
Block text: 
This is line No. 14
T
## Start = 305, end = 326, offset = 21
Block text: 
This is line No. 15
T
## Start = 326, end = 347, offset = 21
Block text: 
This is line No. 16
T
## Start = 347, end = 368, offset = 21
Block text: 
This is line No. 17
T
## Start = 368, end = 389, offset = 21
Block text: 
This is line No. 18
T

Seems f.tell() tells the right start position of each block, but when I want to get blocks by the sizes defined by position differences of offsets, more texts were obtained. Why and how can I fix this?

===========Update===========

When I used rb to read the texts in binary mode, the same results obtained. This is not because of read mode.

with open("test.txt", "rb") as f:
b = f.seek(0)
for line in iter(f.readline, b""):
    offsets.append(b)
    # start position of next block
    b = f.tell()
>>> offsets
[0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 221, 242, 263, 284, 305, 326, 347, 368, 389]
Elkan
  • 546
  • 8
  • 23
  • 3
    On a file opened in text mode, the values returned by `.tell()` have *no valid usage* other than to be passed to `.seek()` later - performing any sort of math on the values is meaningless. This was originally to support historical systems that don't actually have byte-oriented files at all, additional bits had to be tacked on to the file position to specify a byte offset within the words actually stored in the file. But it still applies to one modern system - on Windows, newlines are 2 bytes on disk vs. 1 character in memory, causing a discrepancy in positions. – jasonharper Apr 30 '20 at 05:18
  • 1
    @jasonharper I think that's worth writing up as an answer. – Karl Knechtel Apr 30 '20 at 05:25
  • @jasonharper Thanks. I realized this once I searched online. But based on others' answers, I used `rb` in stead of opening file in `r` mode to make the reading in binary mode, I found I can't easily get the offset if I want to still go through the file line by line. Would appreciate if you can post an answer and fix all these, as Karl suggested. – Elkan Apr 30 '20 at 05:37
  • Agree with @jsonsharper. When a file is opened in r mode, then it is treated as text mode where in new line is treated as single character. If you want it to be treated as 2 character. you must read file in binary mode – ajay gandhi Apr 30 '20 at 05:44

0 Answers0