Why do multiple processes slow down python package importing?

Question

If I import numpy in a single process, it takes approximately 0.0749 seconds: python -c "import time; s=time.time(); import numpy; print(time.time() - s)"

Now if I run the same code in multiple Processes, they all import significantly slower:

import subprocess
cmd = 'python -c "import time; s=time.time(); import numpy; print(time.time() - s)"'

for n in range(5):
    m = 2**n
    print(f"Importing numpy on {m} Process(es):")
    processes = []
    for i in range(m):
        processes.append(subprocess.Popen(cmd, shell=True))
    for p in processes:
        p.wait()
    print()

gives the output:

Importing numpy on 1 Process(es):
0.07726049423217773

Importing numpy on 2 Process(es):
0.110260009765625
0.11645245552062988

Importing numpy on 4 Process(es):
0.13133740425109863
0.1264667510986328
0.13683867454528809
0.153900146484375

Importing numpy on 8 Process(es):
0.13650751113891602
0.15682148933410645
0.17088770866394043
0.1705784797668457
0.1690073013305664
0.18076491355895996
0.18901371955871582
0.18936467170715332

Importing numpy on 16 Process(es):
0.24082279205322266
0.24885773658752441
0.25356197357177734
0.27071142196655273
0.29327893257141113
0.2999141216278076
0.297823429107666
0.31664466857910156
0.20108580589294434
0.33217334747314453
0.24672770500183105
0.34597229957580566
0.24964046478271484
0.3546409606933594
0.26511287689208984
0.2684178352355957

The import time per Process seems to grow almost linearly with the number of Processes (especially as the number of Processes grows large), it seems we spend a total of about O(n^2) time on importing. I know there is an import lock, but not sure why it is there. Are there any work arounds? And if I work on a server with many users running many tasks, could I be slowed down by someone spawning tons of workers that just import common packages?

The pattern is clearer for larger n, here's a script that shows that more clearly by just reporting the average import time for n workers:

import multiprocessing
import time

def f(x):
    s = time.time()
    import numpy as np
    return time.time() - s

ps = []
for n in range(10):
    m = 2**n
    with multiprocessing.Pool(m) as p:
        print(f"importing with {m} worker(s): {sum(p.map(f, range(m)))/m}")

output:

importing with 1 worker(s): 0.06654548645019531
importing with 2 worker(s): 0.11186492443084717
importing with 4 worker(s): 0.11750376224517822
importing with 8 worker(s): 0.14901494979858398
importing with 16 worker(s): 0.20824094116687775
importing with 32 worker(s): 0.32718323171138763
importing with 64 worker(s): 0.5660803504288197
importing with 128 worker(s): 1.034045523032546
importing with 256 worker(s): 1.8989756992086768
importing with 512 worker(s): 3.558808562345803

extra details about environment in which I ran this:

python version: 3.8.6
pip list:

Package    Version
---------- -------
numpy      1.20.1
pip        21.0.1
setuptools 53.0.0
wheel      0.36.2

os:

NAME="Pop!_OS"
VERSION="20.10"

Is it just reading from filesystem that is the problem?

I've added this simple test where instead of importing, I now just read the numpy files and do some sanity check calculations:

import subprocess

cmd = 'python read_numpy.py'

for n in range(5):
    m = 2**n
    print(f"Running on {m} Process(es):")
    processes = []
    for i in range(m):
        processes.append(subprocess.Popen(cmd, shell=True))
    for p in processes:
        p.wait()
    print()

with read_numpy.py:

import os
import time

file_path = "/home/.virtualenvs/multiprocessing-import/lib/python3.8/site-packages/numpy"
t1 = time.time()
parity = 0
for root, dirs, filenames in os.walk(file_path):
    for name in filenames:
        contents = open(os.path.join(root, name), "rb").read()
        parity = (parity + sum([x%2 for x in contents]))%2

print(parity, time.time() - t1)

Running this gives me the following output:

Running on 1 Process(es):
1 0.8050086498260498

Running on 2 Process(es):
1 0.8164374828338623
1 0.8973987102508545

Running on 4 Process(es):
1 0.8233649730682373
1 0.81931471824646
1 0.8731539249420166
1 0.8883578777313232

Running on 8 Process(es):
1 0.9382946491241455
1 0.9511561393737793
1 0.9752676486968994
1 1.0584545135498047
1 1.1573944091796875
1 1.163221836090088
1 1.1602907180786133
1 1.219961166381836

Running on 16 Process(es):
1 1.337137222290039
1 1.3456192016601562
1 1.3102262020111084
1 1.527071475982666
1 1.5436983108520508
1 1.651414394378662
1 1.656200647354126
1 1.6047494411468506
1 1.6851506233215332
1 1.6949374675750732
1 1.744239330291748
1 1.798882246017456
1 1.8150532245635986
1 1.8266475200653076
1 1.769331455230713
1 1.8609044551849365

There is some slowdown, 0.805 seconds for 1 worker, and between 0.819 and 0.888 seconds for 4 workers. Compared to import: 0.07 seconds for 1 worker, and between 0.126 and 0.153 seconds for 4 workers. Seems like there might be something other than filesystem reads slowing down import

each `import` requires searching the filesystem for the module to load (at the locations specified by `sys.path`). That means disk access, which in most cases, that means hitting the SATA bus which does not benefit from parallel access hardly at all (especially if it's spinning rust). — Aaron, Mar 17 '21 at 18:09
BTW, you are actually creating twice the # of python processes, because you `fork` a child process just to call subprocess, which `spawn`s another child. Technically the `fork` is much less expensive than the `spawn`, but just for the record. Also, It rarely makes sense to create more processes than you have logical processing cores on your cpu. Threads are much less expensive from a memory and cpu overhead perspective. — Aaron, Mar 17 '21 at 18:18
@Aaron why would every import have to search/read from the disk? I would have expected everything to get cached after the first read and be faster for following imports. — William Patton, Mar 19 '21 at 13:03
@Aaron, good point. I edited the script to avoid forking as well as spawning a new process. It doesn't seem to have made a difference. — William Patton, Mar 19 '21 at 13:15
`import`'s are only cached within a process. separate processes don't share memory (that's one of the core concepts of what separates processes and threads). It's like opening totally separate instances of python. (remember, I'm specifically talking about using spawn. "fork" copies the memory space to the new process so no additional work is needed) — Aaron, Mar 19 '21 at 18:18
I was under the impression that filesystem reads are cached between processes: https://stackoverflow.com/questions/28828517/file-caching-between-processes So if it was just read speeds from disk then I don't see why this would make importing slow. — William Patton, Mar 22 '21 at 13:15
from the answer to that question: "You can't make any assumptions on how much of the file, if any, is still in cache at any given time". It also looks like you **do** benefit from multiple processes up to a point, but I'd be willing to bet you don't have 512 cpu cores, so you obviously won't get scaling beyond the # of cores you have. There is also a non-trivial amount of cpu work to do on `import` — Aaron, Mar 22 '21 at 13:20
well just going from 1 process to 4 already has a pretty significant slow down. The compute I would expect to be constant because I do have more than 4 cpu cores. I would also expect only 1 read from filesystem since they all need to read the same files, so that seems like a perfect application for caching. I don't really know how caching works so I could be wrong there. Are you saying that I would see the same slow down if instead of `import`, I simply read all of the relevant files and computed some hash of contents? That would be the same 2 steps of "read" + "some compute" — William Patton, Mar 22 '21 at 13:42

Why do multiple processes slow down python package importing?

Is it just reading from filesystem that is the problem?

0 Answers0