How to iterate through very large text file separated by semicolons?

Question

If I want to iterate through a text file line-by-line, here is how I do it:

for curr_line in open('my_file.txt', 'r').readlines()
    print '|' + curr_line + '|'

If I want to iterate through a text based on semi-colon separators, here is how I do it:

for curr_line in open('my_file.txt', 'r').read().split(';')
    print '|' + curr_line + '|'

If I want to iterate through a very large text file line-by-line, here is how I do it:

for curr_line in open('my_file.txt', 'r').xreadlines()
    print '|' + curr_line + '|'

But how can I iterate through a very large text file based on semi-colon separators? It is 7+ gigabytes so I cannot read the whole thing into memory.

Below is the sample input file my_file.txt:

AAAA;BBBBB
BB;CCC;
DDDDD
D
D;
EEEE;F

Here is the output I want to see based on the snippets above:

|AAAA|
|BBBBB
BB|
|CCC|
|DDDDD
D
D|
|EEEE|
|F|

What's the criteria to insert the `|`s? To bracket series of the same item? Is it important to maintain the `\n` like the B's and D's? Can they not be in the same line? — r.ook, Oct 12 '18 at 02:32

dawg · Answer 1 · 2018-10-12T03:50:30.040

1

The method .readlines() reads the entire file into a list. This may not be practicable with a 7GB file.

Given the example added, you can use mmap and a regex to do whole file regex matches without loading the entire file:

import re 
import mmap 

with open(fn,'r+b') as f_in:
    mm = mmap.mmap(f_in.fileno(), 0)    
    for m in re.finditer('([^;]*)', mm):
        txt=m.group(1)
        if txt:
            print('|{}|'.format(txt))

With the example, prints:

|AAAA|
|BBBBB
BB|
|CCC|
|
DDDDD
D
D|
|
EEEE|
|F|

edited Oct 12 '18 at 03:50

answered Oct 12 '18 at 01:38

dawg

98,345
23
131
206

No, this is not right. A semi-colon separated entry might span across multiple lines. Your solution assumes that lines semi-colons are separators within each line. – Saqib Ali Oct 12 '18 at 01:47
What character defines a 'line'? Maybe add an example of the input and output. – dawg Oct 12 '18 at 01:49
Dawg, I just did so. Sorry my original question wasn't clear enough. – Saqib Ali Oct 12 '18 at 01:57
Solution added based on the example. – dawg Oct 12 '18 at 03:51

jedwards · Answer 2 · 2018-10-12T02:50:14.777

Here's a "reader" object that will read blocks (with a size of your choosing) from your file, and emit semicolon-separated items as they're found:

class MyReader:
    def __init__(self, handle, delim, read_size=512):
        self.handle = handle
        self.delim = delim
        self.read_size = read_size


    def __iter__(self):
        buffer = []
        while True:
            block = self.handle.read(self.read_size)
            if not block: break     # Reached EOF

            while block:
                (before, sep, block) = block.partition(self.delim)
                buffer.append(before)

                if sep:             # Separator was found, yield the buffer
                    yield ''.join(buffer)
                    buffer = []

        # We broke free, flush the buffer and return (explicit)
        yield ''.join(buffer)
        return

Which you might use, for example:

with open('file.txt') as f:
    reader = MyReader(f, ';')

    for chunk in reader:
        print(repr(chunk))

Output:

'AAAA'
'BBBBB\nBB'
'CCC'
'\nDDDDD\nD\nD'
'\nEEEE'
'F'

I think the `\n`s in front of the D and E might not meet his requirements, take another look at his expected output. — r.ook, Oct 12 '18 at 02:33
I mean, the `\n` characters come *after* semicolons / are exactly where they are in the file. I went by the text in the question and spirit of the output instead of the exact "expected output" because there are a few obvious errors in there (sets of two pipes separated by only newline characters when there are no similar sequences of semicolons) — jedwards, Oct 12 '18 at 02:42

How to iterate through very large text file separated by semicolons?

2 Answers2