python using multiprocess to filter a massive file

Question

I'm trying to parallelize a file filtering operation, where each filter is a big regex so the whole thing takes time to run. The file itself is around 100GB. The single-process version looks like this:

def func(line):
    # simple function as an example
    for i in range(10**7):
        pass
    return len(line) % 2 == 0


with open('input.txt') as in_sr, open('output.txt', 'w') as out_sr:
    for line in input:
        if func(line):
            out_sr.write(line)

I tried using multiprocessing's imap but that gives ValueError: I/O operation on closed file. I think the iterator is being copied to each process, but not all processes have that handle open.

Is there a way to do this using multiprocessing, preferably making use of pools?

Have you put `with open...` under `if __name__ == '__main__':`? Do you have to keep lines in order? — eph, Dec 03 '15 at 06:52
@eph Yes the lines have to be in the same order as the input file. In my real code the `with` is somewhere in a function. — simonzack, Dec 03 '15 at 06:54
What is your file and regexps like? Would it be easier to do with awk on command line or some other file processing tool? — DainDwarf, Dec 03 '15 at 06:57
@DainDwarf It's a lot of short lines (each around 200 chars), the main thing filter does is checking for regexes but it also does some other minor things, I think keeping it all in python is a bit more maintainable in case it becomes more complex in the future. — simonzack, Dec 03 '15 at 06:59
Before multiprocessing, did you identify what took time using profiling? Did you try using pypy? — DainDwarf, Dec 03 '15 at 07:00
@DainDwarf I didn't profile but another similar operation (decoding strings) runs much faster so I'm quite confident the filter is where the problem comes from. PyPy would help, but other parts of the code uses scipy so it's a bit inconvenient to isolate this part. — simonzack, Dec 03 '15 at 07:04
I know I'm suggesting many things to avoid the initial problem, but is another possibility : use the command "split" to split your big file into multiple files, and then run your program on the many files. — DainDwarf, Dec 03 '15 at 07:05
@DainDwarf Suggestions and ideas welcome :) I think that would work too but 100GB does take up space on the hard drive, there's also the copying overhead which I think multiprocessing avoids. — simonzack, Dec 03 '15 at 07:07
Unless the operations on each line are very complex, your program will be I/O bound, not CPU bound. Are you sure your current program is CPU-bound? — TigerhawkT3, Dec 03 '15 at 07:10
@TigerhawkT3 Yes I am quite sure due to the time difference between encoding and filtering using large regexes (at least 5x, I didn't wait for it to finish). — simonzack, Dec 03 '15 at 07:11
Implement your algorithm to extract lines from a window in a file using https://docs.python.org/3/library/mmap.html. Be careful though, since the line-based nature means you don't know where a line ends, so you will have to do slightly overlapped reading. Three more things: Firstly, if you have problems with `imap()`, maybe fixing those would be appropriate. Secondly, you can also waste a bunch of performance using regexes. Thirdly, you can waste a bunch of time copying data, especially when it's a large amount. — Ulrich Eckhardt, Dec 03 '15 at 07:27
Is there any other error message? Did you run the program for a while and got `ValueError: I/O operation on closed file.`? — Jon, Dec 03 '15 at 08:25
Not sure whether your code is indent correctly or not: http://stackoverflow.com/questions/18952716/valueerror-i-o-operation-on-closed-file — Jon, Dec 03 '15 at 08:31

score 3 · Accepted Answer · answered Dec 03 '15 at 23:22

3

I can run the following code without error. Make sure you are not calling in_sr and out_sr outside the with statement.

from multiprocessing import Pool

def func(line):
    # simple function as an example
    for i in xrange(10**7):
        pass
    return len(line) % 2 == 0, line

def main():
    with open('input.txt','r') as in_sr, open('output.txt', 'w') as out_sr:
        pool = Pool(processes=4)
        for ret,line in pool.imap(func, in_sr, chunksize=4):
            if ret:
                out_sr.write(line)
        pool.close()

if __name__ == '__main__':
    main()

answered Dec 03 '15 at 23:22

Jon

1,211
13
29

Strange, I just tried to install python 3.5.0 and it does work there, I think my previous python version was buggy (it was 3.4.x). Thanks for the answer it definitely helped me diagnose the problem! – simonzack Dec 04 '15 at 03:55
Btw `contextlib.closing` can be used here as an alternative style. – simonzack Dec 04 '15 at 03:56

score 1 · Answer 2 · answered Dec 03 '15 at 07:10

1

The code is similar to this:

def func(line):
    ...

if __name__ == '__main__':

    from multiprocessing import Pool
    from itertools import tee, izip

    pool = Pool(processes=4)

    with open('input.txt') as in_sr, open('output.txt', 'w') as out_sr:
        lines1, lines2 = tee(in_sr)
        for line, flag in izip(lines1, pool.imap(func, lines2)):
            if flag:
                out_sr.write(line)

answered Dec 03 '15 at 07:10

eph

1,988
12
25

imap didn't work due to the ValueError, see my question – simonzack Dec 03 '15 at 07:14
@simonzack I don't think the `ValueError` is due to `imap`, if only line string is pass as argument. – eph Dec 03 '15 at 07:19

python using multiprocess to filter a massive file

2 Answers2