directly use MemmapingPool in python joblib module

Question

the joblib module has provided a tremendously easy-to-use function Parallel to simplify coding. However, it always gathers all the results before you can access any of them.

I have the need to deal with the results one by one because the results are big arrays taking a lot of memory. They cannot reside in the memory at the same time. So I need to deal with part of them first and then discard. Originally, I used futures.as_completed method from MultiprocessPool so that results can be handled immediately when they are available.

But now I also want to use joblib to manage the memmaped arrays for me. Does joblib also has the interface like MultiprocessPool? I looked into the code a little and found MemmapingPool. But these is no document and examples on how to use it.

I have the following questions:

Do I use them the same as using MultiprocessPool?
How to handle Ctrl-C in this case?

score 0 · Accepted Answer · answered Jul 16 '18 at 07:20

After some research and reading the source code of joblib, I get a way to do so by manually manage the memmapped arrays. A code snippet has been posted to gist.

The simplest way to use it is through the wrap function, which will automatically detect memmap and wrap it in SharedArray. The return value will also be wrapped in SharedArray if it is a memmap. Example:

x = np.memmap('data', dtype=int, mode='w+', shape=100)
x[:] = np.random.randint(0, 100, 100)
with concurrent.futures.ProcessPoolExecutor(2) as pool:
    fut1 = pool.submit(*wrap(np.multiply, x[:50], 2))
    fut2 = pool.submit(*wrap(np.multiply, x[50:], -2))
    print(fut1.result())  # or fut1.result().asarray() in case the function returns a memmap
    print(fut2.result())

directly use MemmapingPool in python joblib module

1 Answers1