delay caused by fat argument in python multiprocessing

Question

i am using multiprocessing in python to parallel some computing-heavy functions. but i found that there is a delay in process creating if passing a fat argument (e.g., a 1000-note networkx graph or a 1000000-item list). i experiment on two multiprocessing modules "multiprocessing" and "pathos", get the similar results. my question is how to avoid this kind of delay because it ruins the benefit brought by parallel computing.

in my sample code, i just pass a fat argument to the function for multiprocessing - the function body does not touch the argument as all.

the sample code using "multiprocessing"

import multiprocessing
import time

def f(args):
    (x, conn, t0, graph) = args
    ans = 1
    x0 = x
    t = time.time() - t0
    conn.send('factorial of %d: start@%.2fs' % (x0, t))
    while x > 1:
        ans *= x
        time.sleep(0.5)
        x -= 1
    t = time.time() - t0
    conn.send('factorial of %d: finish@%.2fs, res = %d' %(x0, t, ans))
    return ans

def main():
    var = (4, 8, 12, 20, 16)
    p = multiprocessing.Pool(processes = 4)
    p_conn, c_conn = multiprocessing.Pipe()
    params = []
    t0 = time.time()
    N = 1000
    import networkx as nx
    G = nx.complete_graph(N, nx.DiGraph())

    import random
    for (start, end) in G.edges:
        G.edges[start, end]['weight'] = random.random()

    for i in var:
        params.append((i, c_conn, t0, G))
    res = list(p.imap(f, params))
    p.close()
    p.join()

    print('output:')
    while p_conn.poll():
        print(p_conn.recv())
    t = time.time() - t0
    print('factorial of %s@%.2fs: %s' % (var, t, res))

if __name__ == '__main__':
    main()

the output of the above sample code

output:
factorial of 4: start@29.78s
factorial of 4: finish@31.29s, res = 24
factorial of 8: start@53.56s
factorial of 8: finish@57.07s, res = 40320
factorial of 12: start@77.25s
factorial of 12: finish@82.75s, res = 479001600
factorial of 20: start@100.39s
factorial of 20: finish@109.91s, res = 2432902008176640000
factorial of 16: start@123.55s
factorial of 16: finish@131.05s, res = 20922789888000
factorial of (4, 8, 12, 20, 16)@131.06s: [24, 40320, 479001600, 2432902008176640000, 20922789888000]

Process finished with exit code 0

according to the above output, there is around 24 second delays between two process creating

the sample code using "pathos"

import pathos
import multiprocess
import time

def f(x, conn, t0, graph):
    ans = 1
    x0 = x
    t = time.time() - t0
    conn.send('factorial of %d: start@%.2fs' % (x0, t))
    while x > 1:
        ans *= x
        time.sleep(0.5)
        x -= 1
    t = time.time() - t0
    conn.send('factorial of %d: finish@%.2fs, res = %d' %(x0, t, ans))
    return ans

def main():
    var = (4, 8, 12, 20, 16)
    p = pathos.multiprocessing.ProcessPool(nodes=4)
    p_conn, c_conn = multiprocess.Pipe()
    t0 = time.time()
    conn_s = [c_conn] * len(var)
    t0_s = [t0] * len(var)
    N = 1000
    import networkx as nx
    G = nx.complete_graph(N, nx.DiGraph())

    import random
    for (start, end) in G.edges:
        G.edges[start, end]['weight'] = random.random()

    res = list(p.imap(f, var, conn_s, t0_s, [G] * len(var)))

    print('output:')
    while p_conn.poll():
        print(p_conn.recv())
    t = time.time() - t0
    print('factorial of %s@%.2fs: %s' % (var, t, res))

if __name__ == '__main__':
    main()

the output of the above sample code,

output:
factorial of 4: start@29.63s
factorial of 4: finish@31.13s, res = 24
factorial of 8: start@53.50s
factorial of 8: finish@57.00s, res = 40320
factorial of 12: start@76.94s
factorial of 12: finish@82.44s, res = 479001600
factorial of 20: start@100.72s
factorial of 20: finish@110.23s, res = 2432902008176640000
factorial of 16: start@123.69s
factorial of 16: finish@131.20s, res = 20922789888000
factorial of (4, 8, 12, 20, 16)@131.20s: [24, 40320, 479001600, 2432902008176640000, 20922789888000]

Process finished with exit code 0

similarly, according to the above output, there is around 24 second delays between two process creating.

if i reduce the graph size (smaller node number), the delay decreases accordingly. i guess it is due to the extra time used for pickling/dilling the networkx graph as an argument. ideally, first 4 processes should be created at the same time. how to avoid this cost? thank you!

UPDATE

Thanks to Alexander's kind answer, i remove the pipe in both "multiprocessing" and "pathos" codes. the "multiprocessing" code performs as Alexander's - delay reduced to 1 second, but the "pathos" code still has more than 20 seconds delay. the revised "pathos" code is posted below,

import pathos
import multiprocess
import time
from pympler import asizeof
import sys



def f(args):
    (x, graph) = args
    t = time.ctime()
    print('factorial of %d: start@%s' % (x, t))
    time.sleep(4)
    return x


def main():
    t0 = time.time()
    params = []

    var = (4, 8, 12, 20, 16)
    p = pathos.multiprocessing.ProcessPool(nodes=4)
    N = 1000
    import networkx as nx
    G = nx.complete_graph(N, nx.DiGraph())

    import random
    for (start, end) in G.edges:
        G.edges[start, end]['weight'] = random.random()

    print('Size of G by sys', sys.getsizeof(G), 'asizeof', asizeof.asizeof(G))
    print('G created in %.2f' %  (time.time() - t0))

    for i in var:
        params.append((i, G))
    res = list(p.imap(f, params))
    p.close()
    p.join()

if __name__ == '__main__':
    main()

the output goes as

Size of G by sys 56 asizeof 338079824
G created in 17.36
factorial of 4: start@Fri May 31 11:39:26 2019
factorial of 8: start@Fri May 31 11:39:53 2019
factorial of 12: start@Fri May 31 11:40:19 2019
factorial of 20: start@Fri May 31 11:40:44 2019
factorial of 16: start@Fri May 31 11:41:10 2019

Process finished with exit code 0

Different operating system are handling large arguments for child processes differently. Which OS are you using? — Klaus D., May 31 '19 at 01:14
If you change N from 1000 to say 50 then delay will disappear in "pathos" code also. I assume "pathos" cannot process 338 MB with system speed (C/C++), but rather goes Pythonic way with the interpreter. — Alex Lopatin, May 31 '19 at 23:41

Alex Lopatin · Answer 1 · 2019-05-31T06:12:50.507

This fat argument (338 MB) should be copied to a separate memory when each process is created, but this should not take that long (24 seconds).

Here is how it works on my computer:

The program hangs in conn.send. The problem with the code (1.) is in multiprocess.Pipe(). From https://docs.python.org/3.4/library/multiprocessing.html?highlight=process "... Note that data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time."

So, I changed the code:

import multiprocessing
import os
import time
import sys
from pympler import asizeof
import networkx as nx
import random

def factorial(args):
    (x, t, graph) = args
    s0 = '# pid %s x %2d' % (format(os.getpid()), x)
    s1 = 'started @ %.2f' % (time.time() - t)
    print(s0, s1)
    f = 1
    while x > 1:
        f *= x
        x -= 1
        time.sleep(0.5)
    s2 = 'ended   @ %.2f' % (time.time() - t)
    print(s0, s2, f)
    return s0, s1, s2, f

if __name__ == '__main__':
    t0 = time.time()
    N = 1000
    G = nx.complete_graph(N, nx.DiGraph())
    for (start, end) in G.edges:
        G.edges[start, end]['weight'] = random.random()
    print('Size of G by sys', sys.getsizeof(G), 'asizeof', asizeof.asizeof(G))
    print('G created in %.2f' %  (time.time() - t0))
    t0 = time.time()
    p = multiprocessing.Pool(processes=4)
    outputs = list(p.imap(factorial, [(i, t0, G) for i in (4, 8, 12, 20, 16)]))
    print('output:')
    for output in outputs:
        print(output)

Output now:

Size of G by sys 56 asizeof 338079824
G created in 13.03
# pid 2266 x  4 started @ 1.27
# pid 2267 x  8 started @ 1.98
# pid 2268 x 12 started @ 2.72
# pid 2266 x  4 ended   @ 2.77 24
# pid 2269 x 20 started @ 3.44
# pid 2266 x 16 started @ 4.09
# pid 2267 x  8 ended   @ 5.49 40320
# pid 2268 x 12 ended   @ 8.23 479001600
# pid 2266 x 16 ended   @ 11.60 20922789888000
# pid 2269 x 20 ended   @ 12.95 2432902008176640000
output:
('# pid 2266 x  4', 'started @ 1.27', 'ended   @ 2.77', 24)
('# pid 2267 x  8', 'started @ 1.98', 'ended   @ 5.49', 40320)
('# pid 2268 x 12', 'started @ 2.72', 'ended   @ 8.23', 479001600)
('# pid 2269 x 20', 'started @ 3.44', 'ended   @ 12.95', 2432902008176640000)
('# pid 2266 x 16', 'started @ 4.09', 'ended   @ 11.60', 20922789888000)

338 MB data created in 11 seconds and, yes, it does take time to start the first 4 processes. The delays between starts are although much smaller: 0.71, 0.74, 0.72 seconds. I have iMac with Intel i5 @ 3.2 GHz.

The biggest N, when is no visible delay, is 78:

Size of G by sys 56 asizeof 1970464
G created in 0.08
# pid 2242 x  4 started @ 0.01
# pid 2243 x  8 started @ 0.01
# pid 2244 x 12 started @ 0.01
# pid 2245 x 20 started @ 0.01
# pid 2242 x  4 ended   @ 1.51 24
# pid 2242 x 16 started @ 1.53
# pid 2243 x  8 ended   @ 3.52 40320
# pid 2244 x 12 ended   @ 5.52 479001600
# pid 2242 x 16 ended   @ 9.04 20922789888000
# pid 2245 x 20 ended   @ 9.53 2432902008176640000
output:
('# pid 2242 x  4', 'started @ 0.01', 'ended   @ 1.51', 24)
('# pid 2243 x  8', 'started @ 0.01', 'ended   @ 3.52', 40320)
('# pid 2244 x 12', 'started @ 0.01', 'ended   @ 5.52', 479001600)
('# pid 2245 x 20', 'started @ 0.01', 'ended   @ 9.53', 2432902008176640000)
('# pid 2242 x 16', 'started @ 1.53', 'ended   @ 9.04', 20922789888000)

Hi Alexander, thanks for your generous answer! i removed pipe in both "multiprocessing" and "pathos" codes. the "multiprocessing" code performs as yours, but the "pathos" code still has 24 seconds delay. — liang li, May 31 '19 at 18:30
Hi Alexander, i posted my revised pathos code in the UPDATE section of my question post. Could you possibly help check it? Thank you! — liang li, May 31 '19 at 18:47

score 0 · Answer 2 · answered Jun 01 '19 at 02:01

I changed N to 50 and ran the "pathos" code with a debugger in PyCharm. Stopped after 'G created in 7.79'. The output below confirmed my suspicion about why it is slower with "pathos". Pathos uses connection and socket objects (depending on a platform) to pass arguments and start subprocess. This is why it is so much slower: in about 30 times. On the bright side: it works over the network.

Debug output:

/usr/local/bin/python3.7 "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py" --multiproc --qt-support=auto --client 127.0.0.1 --port 51876 --file /Users/alex/PycharmProjects/game/object_type.py
pydev debugger: process 1526 is connecting

Connected to pydev debugger (build 191.6605.12)
Size of G by sys 56 asizeof 57126904
G created in 7.79
Process ForkPoolWorker-3:
Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Process ForkPoolWorker-4:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/pool.py", line 110, in worker
    task = get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/queues.py", line 354, in get
    with self._rlock:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/pool.py", line 110, in worker
    task = get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/synchronize.py", line 102, in __enter__
    return self._semlock.__enter__()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/queues.py", line 354, in get
    with self._rlock:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/synchronize.py", line 102, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/pool.py", line 110, in worker
    task = get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/queues.py", line 355, in get
    res = self._reader.recv_bytes()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/connection.py", line 219, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/connection.py", line 410, in _recv_bytes
    buf = self._recv(4)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/connection.py", line 382, in _recv
    chunk = read(handle, remaining)
Traceback (most recent call last):
KeyboardInterrupt
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 297, in _bootstrap
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/pool.py", line 110, in worker
    task = get()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/queues.py", line 354, in get
    with self._rlock:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/synchronize.py", line 102, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/pool.py", line 733, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1741, in <module>
    main()
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm CE.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/alex/PycharmProjects/game/object_type.py", line 100, in <module>
    outputs = list(p.imap(factorial, [(i, t0, G) for i in (4, 8, 12, 20, 16)]))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/multiprocess/pool.py", line 737, in next
    self._cond.wait(timeout)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt

Hi Alexnder, appreciated! I have to use pathos because i need to parallel a Class method (bound method), but "multiprocessing" cannot handle Class method (bound method) in python. if i understand correctly, the delay is introduced by sending fat argument to sub-process. how could i overcome this pathos issue? especially, copying a fat object to each process is tedious, could shared memory avoid copying object argument to each process? or any other technique/module could help on? Thank you, Alexander! — liang li, Jun 01 '19 at 14:42
Hi Liang Li, shared memory is available in Python 3.8 and it has multiple limitations: 10 MB size one of them. I would recommend investigating three paths to solve your problem: (1) threading, so the object can be shared. The current threading is not a real thing in Python, but 'import concurrent.futures' promised some solutions. (2) Rewriting the code, so you can call static functions, instead of a class method. (3) C++ — Alex Lopatin, Jun 04 '19 at 02:44
Hi Alex, thanks for your generous reply! is shared memory a part of the "multiprocessing" module? where could i find the document about the limits of shared memory in python? thank you! — liang li, Jun 05 '19 at 00:12
"This constrains storable values to only the int, float, bool, str (less than 10M bytes each), bytes (less than 10M bytes each), and None built-in data types." https://docs.python.org/dev/library/multiprocessing.shared_memory.html — Alex Lopatin, Jun 05 '19 at 00:53

score 0 · Answer 3 · answered Oct 23 '20 at 20:07

On a related note: I was running into this problem while trying to pass a pandas dataframe as an argument to a function with joblib managing the parallel processing.

Joblib pickles the arguments to pass information to each processor. Pickling a dataframe of even modest size (<1MB) can be time consuming. In my case pickling was so bad that joblib with 10-20 workers was slower than a simple loop. However, joblib handles lists, dicts, and np.arrays much more efficiently. So a simple hack I found was to pass a list containing the dataframe content as an np.array and the columns and recombine in the function.

Passing param=[df.values, df.columns] to joblib was 50x faster than simply passing param=df.

delay caused by fat argument in python multiprocessing

3 Answers3