Python: regex match across file chunk boundaries

Question

Huge plain-text data file

I read a huge file in chunks using python. Then I apply a regex on that chunk. Based on an identifier tag, I want to extract the corresponding value. Due to the chunk size, data is missing at the chunk boundaries.

Requirements:

The file must be read in chunks.
The chunk sizes must be smaller than or equal to 1 GiB.

Python code example

identifier_pattern = re.compile(r'Identifier: (.*?)\n')
with open('huge_file', 'r') as f:
    data_chunk = f.read(1024*1024*1024)
    m = re.findall(identifier_pattern, data_chunk)

Chunk data examples

Good: number of tags equivalent to number of values

Identifier: value
Identifier: value
Identifier: value
Identifier: value

Due to the chunk size, you get varying boundary issues as listed below. The third identifier returns an incomplete value, "v" instead of "value". The next chunk contains "alue". This causes missing data after parsing.

Bad: identifier value incomplete

Identifier: value
Identifier: value
Identifier: v

How do you solve chunk boundary issues like this?

Maybe you can find your answer here: [Python regex parse stream](https://stackoverflow.com/questions/4634376/python-regex-parse-stream) — Chiheb Nexus, May 27 '17 at 01:59
Also here: [regular expression on stream instead of string?](https://stackoverflow.com/questions/13004359/regular-expression-on-stream-instead-of-string) — Chiheb Nexus, May 27 '17 at 02:02
Since your pattern appears on a line boundary, maybe you could just read a line at a time and matching on the line instead of chunk. — Himanshu, May 27 '17 at 02:11

score 3 · Answer 1 · answered May 27 '17 at 02:11

3

Assuming this is your exact problem you could probably just adapt your regex and read line by line (which won't load the full file into memory):

import re
matches = []
identifier_pattern = re.compile(r'Identifier: (.*?)$')
with open('huge_file') as f:
    for line in f:
        matches += re.findall(identifier_pattern, line)

print("matches", matches)

answered May 27 '17 at 02:11

Jack

20,735
11
48
48

Good low-memory footprint solution. The file is not line based as the presented example suggests. I hadn't specified the requirement unambiguously. I had to explicitly specify that the file must be read in chunks. In some way I have to find a solution at the chunk boundary, while avoiding accidental double-counts. – JodyK May 27 '17 at 09:49

Andriy Ivaneyko · Answer 2 · 2017-05-27T18:53:43.237

You can control chunk forming and have it close to 1024 * 1024 * 1024, in that case you avoid missing parts:

import re


identifier_pattern = re.compile(r'Identifier: (.*?)\n')
counter = 1024 * 1024 * 1024
data_chunk = ''
with open('huge_file', 'r') as f:
    for line in f:
        data_chunk = '{}{}'.format(data_chunk, line)
        if len(data_chunk) > counter:
            m = re.findall(identifier_pattern, data_chunk)
            print m.group()
            data_chunk = ''
    # Analyse last chunk of data
    m = re.findall(identifier_pattern, data_chunk)
    print m.group()

Alternativelly, you can go two times over same file with different starting point of read (first time from: 0, second time from max length of matched string collected during first iteration), store results as dictionaries, where key=[start position of matched string in file], that position would be same for each iteration, so it shall not be a problem to merge results, however I think it would be more accurate to do merge by start position and length of matched string.

Good Luck !

This is a very clever approach, closest to what I want. I hadn't thought about it like this. The line-based reading will however form a new challenge on multi-processing chunks. That's why I would prefer the f.read() method and feeding the chunks to separate processes. Line-by-line synchronization will be very costly interprocess operations. — JodyK, May 27 '17 at 10:45
@JodyK thanks for your comment, you are right, I've updated answer with an alternative approach — Andriy Ivaneyko, May 27 '17 at 18:54

Pedro Lobito · Answer 3 · 2017-05-27T03:08:37.123

1

If the file is line-based, the file object is a lazy generator of lines, it will load the file into memory line by line (in chunks), based on that, you can use:

import re
matches = []
for line in open('huge_file'):
    matches += re.findall("Identifier:\s(.*?)$", line)

edited May 27 '17 at 03:08

answered May 27 '17 at 03:03

Pedro Lobito

94,083
31
258
268

1

This is indeed a great solution for line-based files. Is there also a solution where the file is not line based and where you 'must' read chunks? – JodyK May 27 '17 at 09:37

score 0 · Answer 4 · answered May 27 '17 at 02:57

0

I have a solution very similar to Jack's answer:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        m.extend(identifier_pattern.findall(line))

You could use a another part of the regex API to have the same result:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        pattern_found = identifier_pattern.search(line)
        if pattern_found:
            value_found = pattern_found.group(0)
            m.append(value_found)

Which we could simplify using a generator expression and a list comprehension

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

with open('huge_file', 'r') as f:
    patterns_found = (identifier.search(line) for line in f)
    m = [pattern_found.group(0) 
         for pattern_found in patterns_found if pattern_found]

answered May 27 '17 at 02:57

EvensF

1,479
1
10
17

I agree that these are good solutions for line-based files. Assuming that we have a strict condition where we 'have to' read the file in chunks: is there a possible solution to get around the chunk boundary issue? – JodyK May 27 '17 at 09:58
These examples were based on your example. But for each iteration could you keep the last few characters from the previous chunk where the pattern could appeared ? – EvensF May 27 '17 at 13:52
I hadn't been clear in the chunk requirement. Your proposal comes close to Andriy's approach. I guess that is the closest way to solve this. I am afraid it is not possible to do a kind of look-ahead in the succeeding chunk or a look-behind in the preceding chunk. Line-by-line approaches take away the multi-processing benefits which one would have with large chunks. – JodyK May 27 '17 at 15:29

score 0 · Answer 5 · answered Jul 24 '18 at 09:57

If the matched result string's length is known, the easiest way I think is to cache the last chunk's bytes around the boundary.

Suppose the result's length is 3, keep the last 2 chars of the last chunk, then add it to the new chunk for matching.

Pseudo-code:

regex  pattern
string boundary
int    match_result_len

for chunk in chunks:
    match(boundary + chunk, pattern)
    boundary = chunk[-(match_result_len - 1):]

Python: regex match across file chunk boundaries

Huge plain-text data file

5 Answers5

Linked