Parallelize a list item append to dict using multiprocessing

Question

I have a large list containing strings. I wish to create a dict from this list such that:

list = [str1, str2, str3, ....]

dict = {str1:len(str1), str2:len(str2), str3:len(str3),.....}

My go to solution was a for loop but its taking too much time (my list contains almost 1M elements):

for i in list:
    d[i] = len(i)

I wish to use the multiprocessing module in python in order to leverage all cores and reduce the time taken for the process to execute. I have come across some crude examples involving manager module to share dict between different processes but am unable to implement it. Any help would be appreciated!

Mike67 · Answer 1 · 2020-10-31T14:44:25.457

I don't know if using multiple process will be faster, but it's an interesting experiment.

General flow:

Create list of random words
Split list into segments, one segment per process
Run processes, pass segment as parameter
Merge result dictionaries to single dictionary

Try this code:

import concurrent.futures
import random
from multiprocessing import Process, freeze_support
    
def todict(lst):
   print(f'Processing {len(lst)} words')
   return {e:len(e) for e in lst}  # convert list to dictionary   

if __name__ == '__main__':
    freeze_support()  # needed for Windows
    
    # create random word list - max 15 chars
    letters = [chr(x) for x in range(65,65+26)] # A-Z
    words = [''.join(random.sample(letters,random.randint(1,15))) for w in range(10000)] # 10000 words

    words = list(set(words))  # remove dups, count will drop

    print(len(words))
    
    ########################
    
    cpucnt = 4  # process count to use
    
    # split word list for each process
    wl = len(words)//cpucnt + 1  # word count per process
    lstsplit = []
    for c in range(cpucnt):
       lstsplit.append(words[c*wl:(c+1)*wl]) # create word list for each process

    # start processes
    with concurrent.futures.ProcessPoolExecutor(max_workers=cpucnt) as executor:
        procs = [executor.submit(todict, lst) for lst in lstsplit]
        results = [p.result() for p in procs]  # block until results are gathered
    
    # merge results to single dictionary
    dd = {}
    for r in results:
       dd.update(r)
    
    print(len(dd))  # confirm match word count
    with open('dd.txt','w') as f: f.write(str(dd)) # write dictionary to text file

Parallelize a list item append to dict using multiprocessing

1 Answers1