Process multiple files in a folder using multiprocessing

Question

I’m trying to read a bunch of files in a folder, process the content and save them. As I have a lot of files I need to parallelize the operation.

Here is the code I tried, but when I run it, nothing happens, I don’t even get any error. It is just stuck. Note that if I use directly process_file() with a file name, it works.

from multiprocessing import Pool
from pathlib import Path
import torch

source_dir = Path('source/path')
target_dir = Path('target/path')

def process_file(file):
    with open(file, 'r') as f:
        result = ... # do stuff with f

    target = target_dir / file.name
    torch.save(result, target)

p = Pool(10)
p.map(process_file, source_dir.iterdir())

I was thinking that maybe it is because .iterdir() yields a generator, but I’m having the same problem with os.listdir(). What am I missing?

Thanks in advance.

I can't replicate, so it would have to be something that is happening within the `process_file` func that you have excluded. — gold_cy, Feb 04 '19 at 14:55
Can we please have a [mcve]? (You may well find it works when you write a trivial MCVE. In that case you need to expand it until you find the minimal example that fails, and then look at how it differs from the slightly smaller minimal example that succeeds. There is then a very good chance the problem will be obvious.) — Martin Bonner supports Monica, Feb 04 '19 at 14:59

Employee · Answer 1 · 2019-02-04T15:28:52.067

Your function process_file needs a full path to the file in order to open it.

You can use the os module to join your current working directory with the folder of interest.

full_paths = []
for el in source_dir.iterdir():
    full_paths.append(os.path.join(os.getcwd(),str(el)))

You can now correctly call the process_file method iterating over the elements present in the full_paths list.

This should do the job

Process multiple files in a folder using multiprocessing

1 Answers1