5

Let's say you have a project that has several levels of folders going on and in various places, to make import calls cleaner, people have amended the PYTHONPATH for the whole project.

This means that instead of saying:

from folder1.folder2.folder3 import foo

they can now say

from folder3 import foo

and add folder1/folder2 to the PYTHONPATH. The question here is, if you keep this up, and have a large number of paths added to PYTHONPATH, does that have an appreciable or significant performance hit?

To add some sense of scale, in terms of performance, I'm asking in terms of milliseconds at a minimum (ie: 100 ms? 500 ms?)

apanzerj
  • 467
  • 3
  • 9

3 Answers3

3

So the performance trade-off between having a lot of different directories in your PYTHONPATH and having deeply-nested package structures will be seen in the system calls. So assuming we have the following directory structures:

bash-3.2$ tree a
a
└── b
    └── c
        └── d
            └── __init__.py
bash-3.2$ tree e
e
├── __init__.py
├── __init__.pyc
└── f
    ├── __init__.py
    ├── __init__.pyc
    └── g
        ├── __init__.py
        ├── __init__.pyc
        └── h
            ├── __init__.py
            └── __init__.pyc

We can use these structures and the strace program to compare and contrast the system calls that we generate for the following commands:

strace python -c 'from e.f.g import h'
PYTHONPATH="./a/b/c:$PYTHONPATH" strace python -c 'import d'

Many PYTHONPATH Entries

So the trade-off here is really system calls at start-up time, versus system calls at import time. For each entry in PYTHONPATH, python first checks to see if the directory exists:

stat("./a/b/c", {st_mode=S_IFDIR|0776, st_size=4096, ...}) = 0
stat("./a/b/c", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0

If the directory exists (it does ... indicated by the 0 on the right), Python will search for a number of modules when the interpreter starts. For each module it checks:

stat("./a/b/c/site", 0x7ffd900baaf0)    = -1 ENOENT (No such file or directory)
open("./a/b/c/site.x86_64-linux-gnu.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("./a/b/c/site.so", O_RDONLY)       = -1 ENOENT (No such file or directory)
open("./a/b/c/sitemodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("./a/b/c/site.py", O_RDONLY)       = -1 ENOENT (No such file or directory)
open("./a/b/c/site.pyc", O_RDONLY)      = -1 ENOENT (No such file or directory)

Each of these fails, and it moves on to the next entry in the path searching for the module to order. My 3.5 intepretter looked up 25 modules this way, producing an incremental 152 system calls on start-up per new PYTHONPATH entry.

Deep package structure

The deep package structure pays no penalty on interpreter start-up, but when we import from the deeply nested package structure we do see a difference. As a baseline, here is the simple import of d/__init__.py from the a/b/c directory in our PYTHONPATH:

stat("/home/matt/a/b/c/d", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
stat("/home/matt/a/b/c/d/__init__.py", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
stat("/home/matt/a/b/c/d/__init__", 0x7ffd900ba990) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__.x86_64-linux-gnu.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__module.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/home/matt/a/b/c/d/__init__.py", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
open("/home/matt/a/b/c/d/__init__.pyc", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0664, st_size=117, ...}) = 0
read(4, "\3\363\r\n\17\3105[c\0\0\0\0\0\0\0\0\1\0\0\0@\0\0\0s\4\0\0\0d\0"..., 4096) = 117
fstat(4, {st_mode=S_IFREG|0664, st_size=117, ...}) = 0
read(4, "", 4096)                       = 0
close(4)                                = 0
close(3)                                = 0

Basically what this is doing is looking for the d package or module. When it finds d/__init__.py it opens it, and then opens d/__init__.pyc and reads the contents into memory before closing both files.

With our deeply nested package structure we have to repeat this operation 3 additional times, which is good for 15 system calls per directory for a total of 45 more system calls. While this is less than half the number of calls added by the addition of a path to our PYTHONPATH, the read calls could potentially be more time-consuming than other system calls (or require more system calls) depending on the size of the __init__.py files.

TL;DR

Taking this all into consideration, these differences are almost certainly not material enough to outweigh the design benefits of your desired solution.

This is especially true if your processes are long-running (like a web-app) rather than being short-lived.

We can reduce the system calls by:

  1. Removing any extraneous PYTHONPATH entries
  2. Pre-compile your .pyc files to avoid needing to write them
  3. Keep your package structure flat

We could more drastically improve performance by removing your py files so they aren't read for debugging purposes along with your PYC files ... but this seems like a step too far to me.

Hope this is useful, it's probably a far deeper dive than is necessary.

Matthew Story
  • 3,573
  • 15
  • 26
  • 1
    TL;DR Don't worry about it until it's an actual noticeable problem. (really great write-up though) – Soviut Jun 29 '18 at 06:50
  • Also: a module is only loaded once per process (the first time it's imported), then it's cached in `sys.modules`, - but this is only true if it's always imported with the same qualified name. When you have the same module accessible as both package.sub.module and sub.module _and_ it's imported once with each path, the module is actually loaded and cached twice. This is inefficient, but it also plays havoc with any piece of code relying on identity - such as exception handlers - since the module object itself _and all the objects it exports_ are duplicated under different ids. – bruno desthuilliers Jun 29 '18 at 11:42
3

This is the most terrible idea ever, really.

First of course because it makes code harder to read and reason about. Wait, 'folder3', where does this come from ???. Also because if two packages define a submodule with the same name, which one you'll get when importing depends on the order in your PYTHONPATH. And once you rearranged your PYTHONPATH so you get "moduleX" from "packageX" and not from "packageY", then somemone add a "moduleY" under "packageX" which shadows the "moduleY" from "packageY". And then you are screwed...

But that's only the less annoying part...

If you have one module using from folder1.folder2.folder3 import foo and another using from folder3 import foo, you end up with two distinct module objects (two instances of your module) in sys.modules - and all the objects defined in those modules are also duplicated (two instances, different id), and now you have a program that starts behaving in the most erratic way whenever there's some identity testing involved. And since exception handling relies on identity, if foo is an exception, depending on which instance of the module raised it and which one is trying to catch it the test will either succeed or fail with no discernable pattern.

Good luck debugging this...

bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • I get this, and believe me, it's my current pain (also, I'm just messenger, I didn't make this choice). I'm just wondering if this practice also affects the performance in terms of lookup. – apanzerj Jun 29 '18 at 15:33
2

It would be unlikely to have an impact on performance unless you were appending paths on slow drive locations. But this is likely to have a negligible effect.

The problem you're most likely to have by appending too many locations to PYTHONPATH is with module conflicts where different locations have the same module but different versions.

ScottMcC
  • 4,094
  • 1
  • 27
  • 35