I want to read a large text file block by block define by offsets. Before that, I used file.tell
to define the start position of each block by quick screening the file with file.seek(0)
, i.e., from start of the file. Then each block is defined by the size using size = i1 - i0
, where i0
, i1
are start positions of current and next block. The block is easy to be obtained by file.read(size)
for further processing. The problem is that block defined in this way is not consistent with what is actually presented in the file. That is, there are additional texts obtained other than the block I want to get. Here is a simple example by supposing each block is a single line.
>>> # construct a simple text file
>>> texts = [f"This is line No. {i}\n" for i in range(20)]
>>> with open("test.txt", "w") as f:
f.writelines(texts)
>>> with open("test.txt", "r") as f:
print(f.read())
This is line No. 0
This is line No. 1
This is line No. 2
This is line No. 3
This is line No. 4
This is line No. 5
This is line No. 6
This is line No. 7
This is line No. 8
This is line No. 9
This is line No. 10
This is line No. 11
This is line No. 12
This is line No. 13
This is line No. 14
This is line No. 15
This is line No. 16
This is line No. 17
This is line No. 18
This is line No. 19
>>> # Get offset for each line
>>> offsets = []
>>> with open("test.txt", "r") as f:
b = f.seek(0)
for line in iter(f.readline, ""):
offsets.append(b)
# start position of next block
b = f.tell()
>>> offsets
[0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 221, 242, 263, 284, 305, 326, 347, 368, 389]
>>> # Read blocks by offsets
>>> for i0, i1 in zip(offsets[:-1], offsets[1:]):
_ = f.seek(i0, 0)
text = f.read(i1 - i0)
print(f"## Start = {i0}, end = {i1}, offset = {i1-i0}")
print("Block text: ")
print(text)
## Start = 0, end = 20, offset = 20
Block text:
This is line No. 0
T
## Start = 20, end = 40, offset = 20
Block text:
This is line No. 1
T
## Start = 40, end = 60, offset = 20
Block text:
This is line No. 2
T
## Start = 60, end = 80, offset = 20
Block text:
This is line No. 3
T
## Start = 80, end = 100, offset = 20
Block text:
This is line No. 4
T
## Start = 100, end = 120, offset = 20
Block text:
This is line No. 5
T
## Start = 120, end = 140, offset = 20
Block text:
This is line No. 6
T
## Start = 140, end = 160, offset = 20
Block text:
This is line No. 7
T
## Start = 160, end = 180, offset = 20
Block text:
This is line No. 8
T
## Start = 180, end = 200, offset = 20
Block text:
This is line No. 9
T
## Start = 200, end = 221, offset = 21
Block text:
This is line No. 10
T
## Start = 221, end = 242, offset = 21
Block text:
This is line No. 11
T
## Start = 242, end = 263, offset = 21
Block text:
This is line No. 12
T
## Start = 263, end = 284, offset = 21
Block text:
This is line No. 13
T
## Start = 284, end = 305, offset = 21
Block text:
This is line No. 14
T
## Start = 305, end = 326, offset = 21
Block text:
This is line No. 15
T
## Start = 326, end = 347, offset = 21
Block text:
This is line No. 16
T
## Start = 347, end = 368, offset = 21
Block text:
This is line No. 17
T
## Start = 368, end = 389, offset = 21
Block text:
This is line No. 18
T
Seems f.tell()
tells the right start position of each block, but when I want to get blocks by the sizes defined by position differences of offsets
, more texts were obtained. Why and how can I fix this?
===========Update===========
When I used rb
to read the texts in binary mode, the same results obtained. This is not because of read mode.
with open("test.txt", "rb") as f:
b = f.seek(0)
for line in iter(f.readline, b""):
offsets.append(b)
# start position of next block
b = f.tell()
>>> offsets
[0, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 221, 242, 263, 284, 305, 326, 347, 368, 389]