3

There is a bloom filter object that created by pybloom, a Python module. Assume that I have over 10 million strings that waiting for add into this object and the general way to do so is:

from pybloom import BloomFilter

# initialize a bloomfilter object
bf = BloomFilter(int(2e7)) 

for i in string_list:
    bf.add(i)

But this costs too much time specially when the string_list is really long. Since my computer(windows7) is 4-core CPU and I want to know if there is a multi-process way to make fully use of CPU and fast the add method.

I know a little about multiprocessing, but I cannot solve the problem that exchanging customized objects, such as bf above, between processes.

Forgive my poor English and show me the code if you can. Thanks.

CoffeeSun
  • 33
  • 3

1 Answers1

0

I'm not really familiar with pybloom or BloomFilter objects, but a quick look at the code reveals that you can union multiple BloomFilter objects.

Based on your size of your string_list you may create a Pool of n. For simplicity, say n=2. Logic here is: For, say x strings in string_list - divide it in 2 lists of size x/2 each. Then create a separate process to process these.

You can have something like this:

from multiprocessing import Pool
with Pool(n) as p:
    bloom_filter_parts = p.map(add_str_to_bloomfilter, divide_list_in_parts(string_list))
    # Now you have a list of BloomFilter objects with parts of string_list in them, concatenate them
    res_bloom_filter=concat_bf_list(bloom_filter_parts)

Code for add_str_to_bloomfilter:

def add_str_to_bloomfilter(str_list_slice):
    res_bf = BloomFilter(capacity=100)
    for i in str_list_slice:
        res_bf.add(i)
    return res_bf

You have to add code for divide_list_in_parts and concat_bf_list. But I hope you get the logic.

Also, read this: https://docs.python.org/3.4/library/multiprocessing.html

DM_Morpheus
  • 710
  • 2
  • 7
  • 20
  • I solve the problem with your logic. It does not match my original thoughts very much, but it works. Thank your again for the answer. – CoffeeSun Jan 03 '19 at 02:24