How to share a customized variable when using multiprocessing in Python?

Question

There is a bloom filter object that created by pybloom, a Python module. Assume that I have over 10 million strings that waiting for add into this object and the general way to do so is:

from pybloom import BloomFilter

# initialize a bloomfilter object
bf = BloomFilter(int(2e7)) 

for i in string_list:
    bf.add(i)

But this costs too much time specially when the string_list is really long. Since my computer(windows7) is 4-core CPU and I want to know if there is a multi-process way to make fully use of CPU and fast the add method.

I know a little about multiprocessing, but I cannot solve the problem that exchanging customized objects, such as bf above, between processes.

Forgive my poor English and show me the code if you can. Thanks.

Maybe try to use Queue in Python, which is designed for multi-processing. — Menglong Li, Jan 02 '19 at 08:43
This would be helpful: https://stackoverflow.com/questions/21968278/multiprocessing-share-unserializable-objects-between-processes — Xiwei Wang, Jan 02 '19 at 09:47

score 0 · Accepted Answer · answered Jan 02 '19 at 09:51

I'm not really familiar with pybloom or BloomFilter objects, but a quick look at the code reveals that you can union multiple BloomFilter objects.

Based on your size of your string_list you may create a Pool of n. For simplicity, say n=2. Logic here is: For, say x strings in string_list - divide it in 2 lists of size x/2 each. Then create a separate process to process these.

You can have something like this:

from multiprocessing import Pool
with Pool(n) as p:
    bloom_filter_parts = p.map(add_str_to_bloomfilter, divide_list_in_parts(string_list))
    # Now you have a list of BloomFilter objects with parts of string_list in them, concatenate them
    res_bloom_filter=concat_bf_list(bloom_filter_parts)

Code for add_str_to_bloomfilter:

def add_str_to_bloomfilter(str_list_slice):
    res_bf = BloomFilter(capacity=100)
    for i in str_list_slice:
        res_bf.add(i)
    return res_bf

You have to add code for divide_list_in_parts and concat_bf_list. But I hope you get the logic.

Also, read this: https://docs.python.org/3.4/library/multiprocessing.html

I solve the problem with your logic. It does not match my original thoughts very much, but it works. Thank your again for the answer. — CoffeeSun, Jan 03 '19 at 02:24

How to share a customized variable when using multiprocessing in Python?

1 Answers1