multiprocessing.Pool sharing large lists of lists read-only in memory across child process

Question

I'm strungling with this problem.

I have a big list of lists that I want to acess with parallel code to perform CPU intensive operations. In order to do that i'm trying to use multiprocessing.Pool, the problem is that I also need to see this massive list of lists across my child process.

As the 'list of lists' is not regular (ex: [[1, 2], [1, 2, 3]]) I can't store them as a mp.Array, and as previouslly said, I'm not using mp.Process so I didin't figure out a way of using mp.Manager on this task. It's important to me to keep this list of lists because i'm applyng a function that querys based on indexes using from operator import itemgetter.

Here is a fictitious example of what i'm trying to achive:

import multiprocessing as mp
from operator import itemgetter
import numpy as np

def foo(indexes):
    # here I must guarantee read acess for big_list_of_lists on every child process somehow
    # as this code would work with only with one child process using global variables but would fail
    # with larger data.
    store_tuples = itemgetter(*indexes)(big_list_of_lists)
    return np.mean([item for sublista in store_tuples for item in sublista])

def main():
    # big_list_of_lists is the varible that I want to share across my child process
    big_list_of_lists = [[1, 3], [3, 1, 3], [1, 2], [2, 0]]

    ctx = mp.get_context('spawn')
    # big_list_of_lists elements are also passed as args
    pool = mp.Pool(ctx.Semaphore(mp.cpu_count()).get_value())
    res=list(pool.map(foo, big_list_of_lists))
    pool.close()
    pool.join()

    return res

if __name__ is '__main__':
    print(main())
# desired output is equivalente to:
# a = []
# for i in big_list_of_lists:
#     store_tuples = itemgetter(*i)(big_list_of_lists)
#     a.append(np.mean([item for sublista in store_tuples for item in sublista]))
# 'a' would be equal to [1.8, 1.5714285714285714, 2.0, 1.75]

other details: solution preferably should be achived using python 3.6 and must work on windows

Thank you very much!

Oli · Accepted Answer · 2021-12-23T14:45:20.830

It seems to work fine for me using mp.Manager, with an mp.Manager.list of mp.Manager.lists. I believe this will not copy the lists to every process.

The important line is:

big_list_of_lists_proxy = manager.list([manager.list(sublist) for sublist in big_list_of_lists])

You may want to use instead, depending on your use case:

big_list_of_lists_proxy = manager.list(big_list_of_lists)

Whether every sublist should be a proxy or not depends on whether each sublist is large, and also whether it is read in its entirety. If a sublist is large, then it is expensive to transfer the list object to each process that needs it (O(n) complexity) and so a proxy list from a manager should be used, however if every element is going to be needed anyway, there is no advantage to using a proxy.

import multiprocessing as mp
from operator import itemgetter
import numpy as np
from functools import partial


def foo(indexes, big_list_of_lists):
    # here I must guarantee read acess for big_list_of_lists on every child process somehow
    # as this code would work with only with one child process using global variables but would fail
    # with larger data.
    store_tuples = itemgetter(*indexes)(big_list_of_lists)
    return np.mean([item for sublista in store_tuples for item in sublista])


def main():
    # big_list_of_lists is the varible that I want to share across my child process
    big_list_of_lists = [[1, 3], [3, 1, 3], [1, 2], [2, 0]]
    ctx = mp.get_context('spawn')
    with ctx.Manager() as manager:
        big_list_of_lists_proxy = manager.list([manager.list(sublist) for sublist in big_list_of_lists])
        # big_list_of_lists elements are also passed as args
        pool = ctx.Pool(ctx.Semaphore(mp.cpu_count()).get_value())
        res = list(pool.map(partial(foo, big_list_of_lists=big_list_of_lists_proxy), big_list_of_lists))
        pool.close()
        pool.join()

    return res


if __name__ == '__main__':
    print(main())
# desired output is equivalente to:
# a = []
# for i in big_list_of_lists:
#     store_tuples = itemgetter(*i)(big_list_of_lists)
#     a.append(np.mean([item for sublista in store_tuples for item in sublista]))
# 'a' would be equal to [1.8, 1.5714285714285714, 2.0, 1.75]

Hi @Oli, thank you for your answer! Apperenty your sollution is working, I'm making a few more tests to be 100% sure. But I wasn't capable of understanting the 'ace in the hole'. Why exacty are you creating a manager.list for each sublist inside a main manager.list? — Patrick Nasser, Dec 23 '21 at 13:19
It's not strictly necessary that the sublists are `manager.list`s - in in this specific example I don't think there's any advantage to it since every sublist that is fetched from the list in `foo` is used in its entirety. My thinking was that since `big_list_of_lists` is probably big, the sublists may also be big, and so it would be advantageous to make them proxies so that they don't have to be serialized and sent to every process that needs to read from them. — Oli, Dec 23 '21 at 14:37

multiprocessing.Pool sharing large lists of lists read-only in memory across child process

1 Answers1