ThreadPoolExecutor, ProcessPoolExecutor and global variables

Question

I am new to parallelization in general and concurrent.futures in particular. I want to benchmark my script and compare the differences between using threads and processes, but I found that I couldn't even get that running because when using ProcessPoolExecutor I cannot use my global variables.

The following code will output Helloas I expect, but when you change ThreadPoolExecutor for ProcessPoolExecutor, it will output None.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

greeting = None

def process():
    print(greeting)

    return None


def main():
    with ThreadPoolExecutor(max_workers=1) as executor:
        executor.submit(process)

    return None


def init():
    global greeting
    greeting = 'Hello'

    return None

if __name__ == '__main__':
    init()
    main()

I don't understand why this is the case. In my real program, init is used to set the global variables to CLI arguments, and there are a lot of them. Hence, passing them as arguments does not seem recommended. So how do I pass those global variables to each process/thread correctly?

I know that I can change things around, which will work, but I don't understand why. E.g. the following works for both Executors, but it also means that the globals initialisation has to happen for every instance.

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

greeting = None

def init():
    global greeting
    greeting = 'Hello'

    return None


def main():
    with ThreadPoolExecutor(max_workers=1) as executor:
        executor.submit(process)

    return None

def process():
    init()
    print(greeting)

    return None

if __name__ == '__main__':
    main()

So my main question is, what is actually happening. Why does this code work with threads and not with processes? And, how do I correctly pass set globals to each process/thread without having to re-initialise them for every instance?

(Side note: because I have read that concurrent.futures might behave differently on Windows, I have to note that I am running Python 3.6 on Windows 10 64 bit.)

jedwards · Accepted Answer · 2018-06-15T09:55:41.623

I'm not sure of the limitations of this approach, but you can pass (serializable?) objects between your main process/thread. This would also help you get rid of the reliance on global vars:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor

def process(opts):
    opts["process"] = "got here"
    print("In process():", opts)

    return None


def main(opts):
    opts["main"] = "got here"
    executor = [ProcessPoolExecutor, ThreadPoolExecutor][1]
    with executor(max_workers=1) as executor:
        executor.submit(process, opts)

    return None


def init(opts):                         # Gather CLI opts and populate dict
    opts["init"] = "got here"

    return None


if __name__ == '__main__':
    cli_opts = {"__main__": "got here"} # Initialize dict
    init(cli_opts)                      # Populate dict
    main(cli_opts)                      # Use dict

Works with both executor types.

Edit: Even though it sounds like it won't be a problem for your use case, I'll point out that with ProcessPoolExecutor, the opts dict you get inside process will be a frozen copy, so mutations to it will not be visible across processes nor will they be visible once you return to the __main__ block. ThreadPoolExecutor, on the other hand, will share the dict object between threads.

can you have a look at this problem? https://stackoverflow.com/questions/59040311/update-variable-while-working-with-processpoolexecutor?noredirect=1#comment104324638_59040311 — johnrao07, Nov 26 '19 at 14:44

score 2 · Answer 2 · answered Oct 27 '20 at 01:34

Actually, the first code of the OP will work as intended on Linux (tested in Python 3.6-3.8) because

On Unix a child process can make use of a shared resource created in a parent process using a global resource.

as explained in multiprocessing doc. However, for a mysterious reasons, it won't work on my Mac running Mojave (which is supposed to be a UNIX-compliant OS; tested only with Python 3.8). And for sure, it won't work on Windows, and it's in general not a recommended practice with multiple processes.

score 0 · Answer 3 · answered Jun 15 '18 at 09:38

Let's image a process is a box while a thread is a worker inside a box. A worker can only access the resources in the box and cannot touch the other resources in other boxes.

So when you use threads, you are creating multiple workers for your current box(main process). But when you use process, you are creating another box. In this case, the global variables initialised in this box is completely different from ones in another box. That's why it doesn't work as you expect.

The solution given by jedwards is good enough for most situations. You can expilictly package the resources in current box(serialize variables) and deliver it to another box(transport to another process) so that the workers in that box have access to the resources.

How about the situations when this solution is not "good enough? e.g. situations where the objects are too big to serialize without a serious performance penalty? Is the only remaining solution to use global variables then? — Jivan, Jun 18 '20 at 23:51
@Jivan Actually, mostly we don't have a choice. On one computer, we can use shared memory. But mostly we are on different machines. We have to pay for that cost. — Sraw, Jun 19 '20 at 05:02

score 0 · Answer 4 · answered Jun 15 '18 at 09:41

A process represents activity that is run in a separate process in the OS meaning of the term while threads all run in your main process. Every process has its own unique namespace.

Your main process sets the value to greeting by calling init() inside your __name__ == '__main__'condition for its own namespace. In your new process, this does not happen (__name__ is '__mp_name__' here) hence greeting remains None and init() is never actually called unless you do so explicitly in the function your process executes.

While sharing state between processes is generally not recommended, there are ways to do so, like outlined in @jedwards answer.

You might also want to check Sharing State Between Processes from the docs.

ThreadPoolExecutor, ProcessPoolExecutor and global variables

4 Answers4