I'm trying to parallelize a file filtering operation, where each filter is a big regex so the whole thing takes time to run. The file itself is around 100GB. The single-process version looks like this:
def func(line):
# simple function as an example
for i in range(10**7):
pass
return len(line) % 2 == 0
with open('input.txt') as in_sr, open('output.txt', 'w') as out_sr:
for line in input:
if func(line):
out_sr.write(line)
I tried using multiprocessing
's imap
but that gives ValueError: I/O operation on closed file.
I think the iterator is being copied to each process, but not all processes have that handle open.
Is there a way to do this using multiprocessing
, preferably making use of pools?