36

Our Django project is getting huge. We have hundreds of apps and use a ton of 3rd party python packages, many of which need to have C compiled. Our deployments are taking a long time when we need to create a new virtual environment for major releases. With that said, I'm looking to speed things up, starting with Pip. Does anyone know of a fork of Pip that will install packages in parallel?

Steps I've taken so far:

  • I've looked for a project that does just this with little success. I did find this Github Gist: https://gist.github.com/1971720 but the results are almost exactly the same as our single threaded friend.

  • I then found the Pip project on Github and started looking through the network of forks to see if I could find any commits that mentioned doing what I'm trying to do. It's a mess in there. I will fork it and try to parallelize it myself if I have to, I just want to avoid spending time doing that.

  • I saw a talk at DjangoCon 2011 from ep.io explaining their deployment stuff and they mention parallelizing pip, shipping .so files instead of compiling C and mirroring Pypi, but they didn't touch on how they did it or what they used.

Kyle
  • 929
  • 2
  • 8
  • 23
  • 1
    Use virtual machines as your unit of deployment and make everything into OS (debian) packages is what we do. You can then run your own repository and do smooth incremental upgrades and complete installs. Having pre-built OS packages is a great way of making sure you have a repeatable install, and you can make them depend on non python stuff like apache or nginx. – Nick Craig-Wood Jun 13 '12 at 18:39
  • @NickCraig-Wood While that is a great idea, we are understaffed and don't have time to convert all the python packages at the versions we use to .debs. We already run everything on top of KVM. We just need deployments to be quicker as soon as possible. – Kyle Jun 13 '12 at 19:04
  • 1
    This is a old question but nowadays you can build a pip wheelhouse cache which cuts down the package installation time considerably. – Mikko Ohtamaa Jan 25 '16 at 19:56

7 Answers7

18

Parallel pip installation

This example uses xargs to parallelize the build process by approximately 4x. You can increase the parallelization factor with max-procs below (keep it approximately equal to your number of cores).

If you're trying to e.g. speed up an imaging process that you're doing over and over, it might be easier and definitely lower bandwidth consumption to just image directly on the result rather than do this each time, or build your image using pip -t or virtualenv.

Download and install packages in parallel, four at a time:

xargs --max-args=1 --max-procs=4 sudo pip install < requires.txt

Note: xargs has different parameter names on different Linux distributions. Check your distribution's man page for specifics.

Same thing inlined using a here-doc:

 cat << EOF | xargs --max-args=1 --max-procs=4 sudo pip install
 awscli
 bottle
 paste
 boto                                                                         
 wheel
 twine                                                                        
 markdown
 python-slugify
 python-bcrypt
 arrow
 redis
 psutil
 requests
 requests-aws
 EOF

Warning: there is a remote possibility that the speed of this method might confuse package manifests (depending on your distribution) if multiple pip's try to install the same dependency at exactly the same time, but it's very unlikely if you're only doing 4 at a time. It could be fixed pretty easily by pip install --uninstall depname.

fatal_error
  • 5,457
  • 2
  • 18
  • 18
  • 1
    Cool hack but I'd hate to have to debug a dependency issue in it :) How did you come by that order? – Itamar Haber Apr 14 '15 at 10:50
  • Thanks @ItamarHaber.. and I agree - that wouldn't be fun :) it's a snippet of the packages file I normally use (which was alphabetized at one point). Spooky is especially cool for working with Redis (as is shortuuid which appears to not be in this list). – fatal_error Apr 21 '15 at 20:06
  • 2
    If you maintained a flattened requirements list, i.e. including all sub-dependencies, then you could add the `--no-deps` option to (possibly, presumably) avoid some of the issues mentioned. Each process would install a single package, independent of the other processes. – Peter Hansen Jul 20 '18 at 19:56
  • 1
    When installing with pip, dependencies are being checked between one another, the "complete" view of all deps is important for pip to install a complete and correct view. the above solution makes pip blind to the different dependencies, so I don't think this can really work in a reliable way. – JAR.JAR.beans Apr 30 '19 at 12:18
  • @JAR.JAR.beans is right - unless `requires.txt` is a "lock file," installing packages this way is not a good idea. – ron rothman May 10 '19 at 01:06
  • With `pip`, there is no centralized database of installed packages, so no need to lock. (To @JAR.JAR.beans' point, there is, however, a minute risk that two dependencies might try to install exactly the same file -- not dependency -- at exactly the same time, but it's quite unlikely, and would be easy to correct if it occurred.) Also see the last paragraph of the answer. – fatal_error May 14 '19 at 16:06
  • Also check out @johntellsall's derivative of my answer above for a slower but safer alternative if this concerns you, or if you are automating this and won't be able to watch any errors as they arise. – fatal_error May 14 '19 at 16:12
  • @JamiesonBecker I don't think we are talking about the same thing. a "lock file" is a common practice in package managers (already added to python as part of poetry and Pipfile) which allows making sure that a certain set of dependencies is installed consistently across environments and servers. This is nothing about race conditions. – JAR.JAR.beans May 15 '19 at 05:33
  • @JAR.JAR.beans well, of course, `pip` doesn't support lock files in the way that pipfile/pipenv does. builds will be non-deterministic when using `pip`, which of course one of the big reasons why package management has evolved so much (not just in python, but in other communities as well like `npm`/`yarn`, `go` modules, etc), as you alluded. so, the obvious answer to your original question is that `requires.txt` is not a lock file and was never intended to be one; if someone wants deterministic builds, they should choose a package manager other than `pip`. See also Peter Hansen's comment above – fatal_error May 17 '19 at 19:18
  • Isn't there a need for a topological sort, to keep certain things from being installed too soon? – dstromberg Dec 10 '19 at 00:18
  • @dstromberg no, since there are no common files or databases that need to be merged or to prevent corruption. The filesystem itself contains the list of packages. The only real risk is that the same exact packages are being installed at the same time -- actually, that the same exact files in the same packages are being written at precisely the same instant. That's pretty unlikely, but obviously don't automate this or do it in production, since there are still some (small) risks. – fatal_error Dec 12 '19 at 05:31
  • @jamiesonBecker – dstromberg Dec 14 '19 at 20:41
13

Building on Fatal's answer, the following code does parallel Pip download, then quickly installs the packages.

First, we download packages in parallel into a distribution ("dist") directory. This is easily run in parallel with no conflicts. Each package name is printed out is printed out before download, which helps with debugging. For extra help, change the -P9 to -P1, to download sequentially.

After download, the next command tells Pip to install/update packages. Files are not downloaded, they're fetched from the fast local directory.

It's compatible with the current version of Pip 1.7, also with Pip 1.5.

To install only a subset of packages, replace the 'cat requirements.txt' statement with your custom command, e.g. 'egrep -v github requirement.txt'

cat requirements.txt | xargs -t -n1 -P9 pip install -q --download ./dist

pip install --no-index --find-links=./dist -r ./requirements.txt
fatal_error
  • 5,457
  • 2
  • 18
  • 18
johntellsall
  • 14,394
  • 4
  • 46
  • 40
  • 3
    I like it! This will be slower (it doesn't parallelize the actual installation in case you are compiling C extensions), but this nicely prevents any concurrent-write issues. Cool idea. – fatal_error Aug 30 '15 at 16:52
  • 7
    Be careful as unfortunately, this can easily mess up the versions of dependencies. But if all library versions are constrained (with the -c switch) it should work fine. – allprog Sep 12 '17 at 07:45
  • uuca as you can use ` < requirements.txt` no need for cat – dalore Jul 18 '19 at 11:37
  • 1
    pip install --download is deprecated. Starting from version 8.0.0 you should use pip download command `pip download ` – bumblebee Jan 23 '20 at 11:14
10

Have you analyzed the deployment process to see where the time really goes? It surprises me that running multiple parallel pip processes does not speed it up much.

If the time goes to querying PyPI and finding the packages (in particular when you also download from Github and other sources) then it may be beneficial to set up your own PyPI. You can host PyPI yourself and add the following to your requirements.txt file (docs):

--extra-index-url YOUR_URL_HERE

or the following if you wish to replace the official PyPI altogether:

--index-url YOUR_URL_HERE

This may speed up download times as all packages are now found on a nearby machine.

A lot of time also goes into compiling packages with C code, such as PIL. If this turns out to be the bottleneck then it's worth looking into compiling code in multiple processes. You may even be able to share compiled binaries between your machines (but many things would need to be similar, such as operating system, CPU word length, et cetera)

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • My first step was mirroring Pypi using z3c.pypimirror and that helped. I think the next step is either using binaries for things that need to be compiled, or parallelizing. I tried my best to make sense of what is happening in the Gist code. I believe it's scheduling the subprocess and it's being run at a later time. I'm not sure how to make sure all the subprocesses run at the same time using gevent. – Kyle Jun 13 '12 at 19:01
  • If you want to speed up compilations I've had a lot of luck with ccache in the past. – Nick Craig-Wood Jun 14 '12 at 07:02
3

Will it help if you have your build system (e.g. Jenkins) build and install everything into a build-specific virtual environment directory? When the build succeeds, you make the virtual environment relocatable, tarball it and push the resulting tablall to your "released-tarballs" storage. At deploy time, you need to grab the latest tarball and unpack it on the destination host and then it should be ready to execute. So if it takes 2 seconds to download the tarball and 0.5 seconds to unpack it on the destination host, your deployment will take 2.5 seconds.

The advantage of this approach is that all package installations happen at build time, not at deploy time.

Caveat: your build system worker that builds/compiles/installs things into a virtual env must use same architecture as the target hardware. Also your production box provisioning system will need to take care of various C library dependencies that some Python packages may have (e.g. PIL requires that libjpeg installed before it can compile JPEG-related code, also things will break if libjpeg is not installed on the target box)

It works well for us.

Making a virtual env relocatable:

virtualenv --relocatable /build/output/dir/build-1123423

In this example build-1123423 is a build-specific virtual env directory.

Community
  • 1
  • 1
Pavel Repin
  • 30,663
  • 1
  • 34
  • 41
1

I come across with a similar issue and I ended up with the below:

cat requirements.txt | sed -e '/^\s*#.*$/d' -e '/^\s*$/d' | xargs -n 1 python -m pip install

That will read line by line the requirements.txt and execute pip. I cannot find from where I got the answer properly, so apologies for that, but I found some justification below:

  1. How sed works: https://howto.lintel.in/truncate-empty-lines-using-sed/
  2. Another similar answer but with git: https://stackoverflow.com/a/46494462/7127519

Hope this help with alternatives. I posted this solution here https://stackoverflow.com/a/63534476/7127519, so maybe there is some help there.

Rafael Valero
  • 2,736
  • 18
  • 28
  • 1
    The `sed` command is very useful with pip-compile output. I found I needed to remove the `^` to make it work. Like this: `sed -e '/^\s*#.*$/d' -e '/^\s*$/d'` – River Jul 06 '22 at 20:47
0

The answer at hand is to use for example poetry if you can which has parallel download/install by default. but question is about pip, so:

If some of you need to install dependencies from requirements.txt that have hash parameters and python specifiers (or just hash) you cannot use normal pip install as it does not support it. Your only choice is to use pip install -r

So the question is how to parallel install from requirements file where each dependency has hash and python specifier defined? Here si how requirements file looks:

swagger-ui-bundle==0.0.9; python_version >= "3.8" and python_version < "4.0" \
    --hash=sha256:cea116ed81147c345001027325c1ddc9ca78c1ee7319935c3c75d3669279d575 \
    --hash=sha256:b462aa1460261796ab78fd4663961a7f6f347ce01760f1303bbbdf630f11f516
typing-extensions==4.0.1; python_version >= "3.8" and python_version < "4.0" \
    --hash=sha256:7f001e5ac290a0c0401508864c7ec868be4e701886d5b573a9528ed3973d9d3b \
    --hash=sha256:4ca091dea149f945ec56afb48dae714f21e8692ef22a395223bcd328961b6a0e
unicon.plugins==21.12; python_version >= "3.8" and python_version < "4.0" \
    --hash=sha256:07f21f36155ee0ae9040d810065f27b43526185df80d3cc4e3ede597da0a1c72

This is what I came with:

# create temp directory where we store split requirements
mkdir -p pip_install
# join lines that are separated with `\` and split each line into separate 
# requirements file (one dependency == one file),
# and save files in previously created temp directory
sed ':x; /\\$/ { N; s/\\\n//; tx }' requirements.txt | split -l 1 - pip_install/x
# collect all file paths from temp directory and pipe them to xargs and pip
find pip_install -type f | xargs -t -L 1 -P$(nproc) /usr/bin/python3 -mpip install -r
# remove temp dir
rm -rf pip_install
scagbackbone
  • 681
  • 1
  • 8
  • 20
-5

Inspired by Jamieson Becker's answer, I modified an install script to do parallel pip installs and it seems like and improvement. My bash script now contains a snippet like this:

requirements=''\
'numpy '\
'scipy '\
'Pillow '\
'feedgenerator '\
'jinja2 '\
'docutils '\
'argparse '\
'pygments '\
'Typogrify '\
'Markdown '\
'jsonschema '\
'pyzmq '\
'terminado '\
'pandas '\
'spyder '\
'matplotlib '\
'statlab '\
'ipython[all]>=3 '\
'ipdb '\
''tornado>=4' '\
'simplepam '\
'sqlalchemy '\
'requests '\
'Flask '\
'autopep8 '\
'python-dateutil '\
'pylibmc '\
'newrelic '\
'markdown '\
'elasticsearch '\
"'"'docker-py==1.1.0'"'"' '\
"'"'pycurl==7.19.5'"'"' '\
"'"'futures==2.2.0'"'"' '\
"'"'pytz==2014.7'"'"' '

echo requirements=${requirements}
for i in ${requirements}; do ( pip install $i > /tmp/$i.out 2>&1 & ); done

I can at least look for problems manually.

Community
  • 1
  • 1
John Schmitt
  • 1,148
  • 17
  • 39
  • 2
    You should use a requirements file (e.g. `pip install -r requirements.txt`). Less importantly, Bash allows for newlines in strings, so you don't need to close the string and escape the newline – Leo Sep 08 '17 at 20:06