Huge plain-text data file
I read a huge file in chunks using python. Then I apply a regex on that chunk. Based on an identifier tag, I want to extract the corresponding value. Due to the chunk size, data is missing at the chunk boundaries.
Requirements:
- The file must be read in chunks.
- The chunk sizes must be smaller than or equal to 1 GiB.
Python code example
identifier_pattern = re.compile(r'Identifier: (.*?)\n')
with open('huge_file', 'r') as f:
data_chunk = f.read(1024*1024*1024)
m = re.findall(identifier_pattern, data_chunk)
Chunk data examples
Good: number of tags equivalent to number of values
Identifier: value
Identifier: value
Identifier: value
Identifier: value
Due to the chunk size, you get varying boundary issues as listed below. The third identifier returns an incomplete value, "v" instead of "value". The next chunk contains "alue". This causes missing data after parsing.
Bad: identifier value incomplete
Identifier: value
Identifier: value
Identifier: v
How do you solve chunk boundary issues like this?