I am writing a function to convert PDF to PNG images, it looks like this:
import os
from wand.image import Image
def convert_pdf(filename, resolution):
with Image(filename=filename, resolution=resolution) as img:
pages_dir = os.path.join(os.path.dirname(filename), 'pages')
page_filename = os.path.splitext(os.path.basename(filename))[0] + '.png'
os.makedirs(pages_dir)
img.save(filename=os.path.join(pages_dir, page_filename))
When I try to parallelize it, the memory is growing and I cannot finish the processing of my PDF files:
def convert(dataset, resolution):
Parallel(n_jobs=-1, max_nbytes=None)(
delayed(convert_pdf)(filename, resolution) for filename in glob.iglob(dataset + '/**/*.pdf', recursive=True)
)
When I call the function in serial, the memory stay constant.
How joblib manage the memory allocation for each parallel instance?
How can I modify my code so that the memory stay constant when running in parallel?