To clarify first of all, I'm NOT asking why map in multiprocessing is slow.
I had code working just fine using pool.map()
. But, in developing it (and to make it more generic), I needed to use pool.starmap()
to pass 2 arguments instead of one.
I'm still fairly new to Python and multiprocessing, so I'm not sure if I'm doing something obviously wrong here. I also couldn't find anything on this that's been previously asked, so apologies if this has already been answered.
Using python 3.10 by the way.
I'm processing just under 5 million items in a list, and managed to get a result in just over 12 minutes (instead of a predicted 6 1/2 days if it was run iteratively!) when using pool.map()
.
I'm essentially obtaining the intersection of List_A
and List_B
, but I need to preserve the frequencies of each item, and so have to do it in O(n^m).
But now, I'm getting significantly longer times when using pool.starmap()
.
I can't seem to work out why, and if anyone can give me some indiciation it would be greatly appreciated!
Here is the code for pool.map()
that works quickly as expected (where List_B
is actually part of the listCompare
function:
def listCompare(list_A):
toReturn = []
for item in list_A:
if item in list_B:
toReturn.append(item)
return toReturn
out = []
chnks = chunks(list_a, multiprocessing.cpu_count())
with multiprocessing.Pool() as pool:
for result in pool.map(listCompare, chnks):
out.extend(result)
print("Parallel:", out)
Here is the code for pool.starmap()
that works slowly. listCompare
is modified to take 2 arguments here:
(I can't use my chunks
method here, as I can't pass the yeild into the tuple, so I've set the chunksize differently. Is this the reason for the slow down?)
def listCompare(list_A, list_B):
toReturn = []
for item in list_A:
if item in list_B:
toReturn.append(item)
return toReturn
with multiprocessing.Pool() as pool:
for resultA in pool.starmap(listCompare, [(list_a1, list_b1)], chunksize=multiprocessing.cpu_count()):
output_list1.extend(resultA)
for resultB in pool.starmap(listCompare, [(list_a2, list_b2)], chunksize=multiprocessing.cpu_count()):
output_list2.extend(resultB)
for resultC in pool.starmap(listCompare, [(list_a3, list_b3)], chunksize=multiprocessing.cpu_count()):
output_list3.extend(resultC)
for resultD in pool.starmap(listCompare, [(list_a4, list_b4)], chunksize=multiprocessing.cpu_count()):
output_list4.extend(resultD)
Thanks in advance, and apologies if I've missed out anything that may help in answering!
As I said earlier, I know this can be done with intersection
, but I need the frequencies of each occurance, so I need to preserve duplicates.