1

I am new to python and multiprocessing concepts in python (this is my first python project).

I have written few modules and wired them up together to work in sequential manner. Right now, i have requirement to fasten few things.

What i want to achieve is:

module-one.py
    Read a json and store it as dict (normal dict or multiprocessing.Manager.dict)
    module-two.method()
    
module-two.py
    -- Some methods for business logic --
    multiprocessing.process(target=module-three.method)
    
module-three.py
    def method():
        multiprocessing.process(target=module-four.method)
        
module-four.py
    def method():
        I should access the dict that was created in module-one
        The global dict that mutiple processes can access
        --- More business logic and data transformations ---
        
Note:
    I am constrained not to use any frameworks like Flask. Else, i could have tried flask g to store things globally.
    I am constrained not to use any external caching mechanisms like memcache or redis

To lessen the overhead, i tried combining the modules three and four into one. That also did not help. The dict in module-four or module-three is always empty.

My questions are:

  1. Is it possible to achieve what i have posted above?
  2. If it is not possible, what are the alternate ways to handle my requirements.

I browsed extensively stackoverflow and other forums. I found many single module examples where dict is created at module namespace or inside a class and same dict is passed as an argument to spawning processes. Based on those examples, it looks like i should pass the dict from module-one to module-two and so on upto module-four. I felt that there might be a better approach instead of passing the dict from one module to another. Hence i am posting this question.

Thanks, A newbie python coder

Strive
  • 43
  • 4
  • It would help a lot if you share the code, it's not quite clear to me what you wanna do. – constt May 01 '22 at 05:41
  • have you tried the `multiprocessing.shared_memory` module? – Alexander May 01 '22 at 11:15
  • The easiest way to share a dictionary across processes is with a *managed* dictionary, e.g. `multiprocessing.Manager().dict()`. But be aware of the overhead of each access and some of its peculiarities. – Booboo May 01 '22 at 11:51
  • @constt I do not have the code in the system from where i am posting the question as all the development happens in secure network and we do not have access to internet in that machine. I will try to write something simple and post it – Strive May 02 '22 at 04:32
  • @alexpdev No, i have not tried shared_memory. I will check that. – Strive May 02 '22 at 04:33
  • @Booboo I read about that and saw many examples. I should pass the created `multiprocessing.Manager().dict()` as an argument to the spawning processes. Is there an option where i can access it without passing it as an argument. The question may sound dumb. Please note that i am trying to learn python and get a hang of how things work in python. – Strive May 02 '22 at 04:35
  • @Strive The response is too long for a comment, so see my answer below. – Booboo May 02 '22 at 10:51

1 Answers1

0

This is how you can avoid having to pass the managed dictionary explicitly to your worker function, which can now instead access it as a global variable:

Both multiprocessing.pool.Pool and concurrent.futures.ProcessPoolExecutor have initializer and initargs arguments that allow you to specify a function and arguments to be passed to that function that will be called once for each process in the multiprocessing pool to allow that function to initialize global variables for each process. So, for example, using multiprocessing.pool.Pool, the following code will work for both Windows and Linux:

from multiprocessing import Pool, Manager

def init_pool_processes(d):
    # Initialize global variable managed_dict for each pool process:
    global managed_dict
    managed_dict = d

def worker(i):
    managed_dict[i] = i ** 2

def main():
    manager = Manager()
    managed_dict = manager.dict()
    pool = Pool(initializer=init_pool_processes, initargs=(managed_dict,))
    pool.map(worker, range(10))
    pool.close()
    pool.join()
    print(managed_dict)

if __name__ == '__main__': # Required for Windows
    main()

Prints:

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

If you are running under a platform such as Linux that uses the fork method to create new processes, then those new processes would inherit the main processes global variables automatically as read/only variables. Once an attempt is made to modify such a variable by the subprocess, then a copy of that variable is made. However, in this case, the global variable in question is a reference to a managed dictionary (actually, a reference to a proxy for the actual dictionary) and this reference is not modified by the subprocesses, only what the reference refers to, i.e. the dictionary itself. So the following code would be used:

from multiprocessing import Pool, Manager

def worker(i):
    managed_dict[i] = i ** 2

def main():
    global managed_dict
    manager = Manager()
    managed_dict = manager.dict()
    # the processes created will inherit global managed_dict:
    pool = Pool()
    pool.map(worker, range(10))
    pool.close()
    pool.join()
    print(managed_dict)

main()

Prints:

{0: 0, 8: 64, 9: 81, 1: 1, 2: 4, 3: 9, 5: 25, 6: 36, 7: 49, 4: 16}

This is why when you post a question tagged with multiprocessing, you are supposed to also tag the question with the platform you are running under; the answer very much depends on the platform.

Booboo
  • 38,656
  • 3
  • 37
  • 60