35

We have 100+ private packages and so far we've been using s3pypi to set up a private pypi in an s3 bucket. Our private packages have dependencies on each other (and on public packages), and it is (of course) important that our GitLab pipelines find the latest functional version of packages it relies on. I.e. we're not interested in the latest checked in code. We create new wheels only after tests and qa has run against a push to master (which is a long-winded way of explaining that -e <vcs> requirements will not work).

Our setup works really well until someone creates a new public package on the official pypi that shadows one of our package names. We can force our private package to be chosen by increasing the version number so it is higher than the new package on pypi.org - or by renaming our package to something that haven't yet been taken on pypi.org.

This is obviously a hacky and fragile solution, but apparently the functionality is this way by-design.

After the initial bucket setup s3pypi has required no maintenance or administration. The above ticket suggests using devpi but that seems like a very heavy solution that requires administration/monitoring/etc.

GitLab's pypi solution seems to be at individual package level (meaning we'd have to list up to 100+ urls - one for each package). This doesn't seem practical, but maybe I'm not understanding something (I can see the package registry menu under our group as well, but the docs point to the "package-pypi" docs).

We can't be the first small company that has faced this issue..? Is there a better way than to register dummy versions of all our packages on pypi.org (with version=0.0.1, so the s3pypi version will be preferred)?

alex
  • 6,818
  • 9
  • 52
  • 103
thebjorn
  • 26,297
  • 11
  • 96
  • 138
  • 2
    It might be too late for this, but I would recommend prefixing your private package names, e.g. `yourcompanyname-packagename`. – Dustin Ingram Aug 10 '20 at 17:37
  • @DustinIngram yes, it's a little late :-) We do prefix a little over half our packages, although with a shorter prefix so the installable and importable names are the same - since it makes our code that works with packages as data (cross package dependency analysis etc.) much easier. – thebjorn Aug 10 '20 at 18:40
  • You could use a `requirements.txt` file and then specify the index for each package, as in https://stackoverflow.com/a/61784078/5666087. – jkr Aug 11 '20 at 14:54
  • It might not be the solution for you, but I tell what we do. 1) Prefix the package names, and using namespaces (eg. `company.product.tool`). 2) When we install our packages (including their in-house dependencies), we use a `requirements.txt` file including our PyPI URL. We run everything in container(s) and we install all public dependencies in them when we are building the images. – Balázs Aug 11 '20 at 15:12
  • 1
    @jakub the problem with adding `--index-url` declarations to `requirements.txt` is that the specified url will be the only one used for lookup, i.e. if my package foo has PIL in setup.py/install_requires pip will try to look up PIL in my private repo and then fail... – thebjorn Aug 11 '20 at 15:12
  • @Balázs we do most of those, most of the time ;-) We have a 2-letter prefix for many packages. We don't have one overarching namespace. We use `-e packagedir` in our `requirements.txt` files to make it easier for developers to work on cross-package issues, but we rewrite the `requirements.txt` file to install wheels at the start of our pipelines. Our pipelines are ran on containers using our k8 cluster hosted on google's cloud, and we do install many external requirements in the containers, but not all (too many, and it make it difficult for devs to test upgrades to externals). – thebjorn Aug 11 '20 at 15:26
  • @Balázs ...and even if it perhaps isn't the perfect solution for my exact situation, your solution sounds like it could be useful for others. If you write it up as an answer I'll at least give you an upvote :-) – thebjorn Aug 11 '20 at 15:32
  • @thebjorn Thanks, I added it as an answer. – Balázs Aug 12 '20 at 19:35
  • "Is there a better way than to register dummy versions of all our packages on pypi.org"-- i think this might actually be considered name squatting and not allowed on PyPI, although I guess it depends on the context – Chris_Rands Aug 13 '20 at 17:35
  • @Chris_Rands so... is there a better solution? – thebjorn Aug 13 '20 at 23:30
  • 2
    We've solved similar issues in my last two jobs with both Artifactory and Sonartype, creating a proxy package repository for the public ones, a private package repository for the internal stuff, and then exposing a virtual package repository that aggregates those two. We upload our packages to the private one, and always query/install them from the virtual one. You request a package, it is looked for in the private first, and if not found the proxy tries the public pypi. – Jacobo de Vera Aug 18 '20 at 10:05
  • I guess you don't, but by chance: do you pin all the requirements (or at least your private ones)? – sinoroc Aug 18 '20 at 10:58
  • @sinoroc we pin our external requirements, not our private requirements (that would make cross-package debugging incredibly difficult). – thebjorn Aug 19 '20 at 00:39
  • I like the idea you suggest at the end "create a dummy version on pypi". Easy to automate and probably stable in time, but I would love to see a proper solution to this problem. – cglacet Aug 29 '20 at 15:16

5 Answers5

14

It might not be the solution for you, but I tell what we do.

  1. Prefix the package names, and using namespaces (eg. company.product.tool).
  2. When we install our packages (including their in-house dependencies), we use a requirements.txt file including our PyPI URL. We run everything in container(s) and we install all public dependencies in them when we are building the images.
Balázs
  • 496
  • 3
  • 8
  • 3
    _Namespaces are one honking great idea -- let's do more of those!_ - The Zen of Python, by Tim Peters `python -c 'import this'` – ti7 Aug 14 '20 at 15:20
4

We use VCS for this. I see you've explicitly ruled that out, but have you considered using branches to mark your latest stable builds in VCS?

If you aren't interested in the latest version of master or the dev branch, but you are running test/QA against commits, then I would configure your test/QA suite to merge into a branch named something like "stable" or "pypi-stable" and then your requirements files look like this:

pip install git+https://gitlab.com/yourorg/yourpackage.git@pypi-stable

The same configuration will work for setup.py requirements blocks (which allows for chained internal dependencies).

Am I missing something?

kerasbaz
  • 1,774
  • 1
  • 6
  • 15
  • Ah, the "use _this_ one" strategy. I like the simplicity of it. I hadn't thought of using a branch as a depository like that. Hmm... I guess we could use a similar strategy and simply copy the wheel file to s3 (instead of uploading it to a pypi-like structure) and use `https://s3.bucket/dev-wheels/foo.whl` in the requirements.txt file... (probably faster to merge to a branch on "upload" but maybe faster to download+install a wheel than doing a clone - although we'd probably need to pass `--no-cache-dir` to get around the wheel cache...) – thebjorn Aug 16 '20 at 20:49
3

Your company could redirect all requests to pypi to a service you control first (perhaps just at your build servers' hosts file(s))

This would potentially allow you to

  • prefer/override arbitrary packages with local ones
  • detect such cases
  • cache common/large upstream packages locally
  • reject suspect/non-known versions/names of upstream packages
ti7
  • 16,375
  • 6
  • 40
  • 68
  • That's what we're trying to do. The `s3pypi` package/tool creates a pypi index in an S3 bucket that you control, and we specify the url of this bucket in the `PIP_EXTRA_INDEX_URL` environment variable so pip knows about it. `pip` will pick the version in our bucket as long as it is either (i) uniquely named, or (ii) has a version number that is higher than the version in the official pypi (neither of which can be relied upon). I would much prefer a solution that doesn't involve writing our own version of pypi ;-) – thebjorn Aug 15 '20 at 04:54
  • @thebjorn I think what ti7 meant was that you could try to configure your server's DNS settings such that it resolves `pypi.org` to `your-bucket.url` (e.g. via `ALIAS`) and then set `PIP_EXTRA_INDEX_URL=pypi-orig.org` which in turn gets resolved to the original `pypi.org`. This is just a sketch though and I don't know if it's possible with S3. Or you could have it point to a custom service which manages the dispatching to pypi and your private index. – a_guest Aug 16 '20 at 20:15
  • @a_guest we haven't tried messing with the DNS records, but just reversing which site is considered the official/extra index (can be done with env-vars) does not work. pip will still look for the highest version number if a package is found in both places. – thebjorn Aug 16 '20 at 20:32
  • @thebjorn I see. But at least you could redirect `pypi.org` to a custom service which dispatches http requests to the original pypi and your private index, based on the names of distributions, and then just leave out `PIP_EXTRA_INDEX_URL`. – a_guest Aug 16 '20 at 21:45
  • 1
    @a_guest sure, then I just need to write the custom service, and update its knowledge about which packages are ours when we create a new one (and allocate server space for it, attach monitoring and alerting, ..and all the other "stuff" that goes with a production server). We're a small company, so I was hoping for something that was more turn-key - we can't be the first small company that has these problems..? – thebjorn Aug 17 '20 at 00:59
  • @thebjorn You're probably not, but as others have suggested, prefixing package names is a way to prevent that problem. I understand that in the current situation this is not an option for you, but as a quick way out, what if you increment the version numbers of all your packages by, say, 1000 (or any other large number)? So if your package A has version `12.1.3` right now, just release a new version `1012.1.3` and you should be safe for the next millennium or so. – a_guest Aug 17 '20 at 12:02
  • @a_guest upping the version is the stop-gap fix that we've implemented (not by 1000 since we want to continue using sensible version numbers, and in the hope that we can find a better solution in the near'ish future). – thebjorn Aug 17 '20 at 15:06
  • 2
    @thebjorn One more thought on this, incrementing the version number even by 1000 (or a smaller number as you say), is not completely safe because a newly registered project on pypi might use a [date-based versioning scheme](https://packaging.python.org/guides/distributing-packages-using-setuptools/#date-based-versioning) and hence use a version number like `2020.8` which will most likely compare greater than any semantic versioning number. – a_guest Aug 17 '20 at 15:12
2

You could perhaps get the behavior you are looking for from a requirements.txt and two pip calls:

cat requirements.txt | xargs -n 1 pip install -i <your-s3pipy>
pip install -r requirements.txt

The first one tries to install what it can from your local repository and ignores a package if it fails. The second call tries to install everything that failed before from pipy.

This works because --upgrade-strategy only-if-needed is the default (as of pip 10.X I believe, don't quote me on that). If you are using an old pip you may have to specify this manually.


A limitation of this approach is if you expect/request a local package, but it doesn't exist and a package with the same name exists on pipy. In this case, you will get that package instead. Not sure if that is a concern.

FirefoxMetzger
  • 2,880
  • 1
  • 18
  • 32
  • 1
    What if a private distribution depends on a public one which is hosted on pypi? Then it won't be installed during the first command since it cannot resolve that dependency and it will get installed during the second command where the same problem with conflicting names occurs. – a_guest Aug 17 '20 at 15:26
  • @a_guest In this case, you would **really** want to go with namespaces. You could write a script that installs a distribution via `pip --no-deps -i dist`. If it fails, it will add the distribution to the list to be installed from pipy; if it succeeds it recurses into the distribution's requirements and continues. You will get problems with partial installs though if you can't resolve a package, have circular dependencies, can't satisfy a version, ... . At this point it feels like you are just shy of running your own dependency manager. Namespaces seem much easier... – FirefoxMetzger Aug 17 '20 at 20:23
  • *"if it succeeds it recurses into the distribution's requirements and continues"* How would you realize this? As far as I know `pip` doesn't have a `--only-deps` option and even if it did, you would likely need the original pypi index as well; but in that case name conflicts are still a problem. The thing is that `pip` gives equal priority to all indices, but that's not always desirable. – a_guest Aug 17 '20 at 21:28
  • @a_guest there is `pip check`, which checks if you have missing or conflicting dependencies. It prints a text for each missing dep; you can either parse that or look at the source and call the internal `pip._internal.operations.check.check_package_set`, which returns a list of `missing` and `conflicting` dependencies which you can then iterate over and resolve. You will have to provide the logic for that though. – FirefoxMetzger Aug 18 '20 at 07:15
  • Honestly, that sounds like it's easier to roll your own fork of pip which implements a search order for package indices. Though it seems workable indeed. I think you should add that to your answer, because as it stands it won't solve the OP's problem. – a_guest Aug 18 '20 at 09:24
2

The comment from @a_guest on my first answer got me thinking, and the "problem" is that pip doesn't consider where the package originated when it sorts through candidates to satisfy requirements.

So here is a possible way to change this: Monkey-patch pip and introduce a preference over indexes.

from __future__ import absolute_import
import os
import sys

import pip
from pip._internal.index.package_finder import CandidateEvaluator


class MyCandidateEvaluator(CandidateEvaluator):
    def _sort_key(self, candidate):
        (has_allowed_hash, yank_value, binary_preference, candidate.version,
         build_tag, pri) = super()._sort_key(candidate)

        priority_index = "localhost"  #use your s3pipy here
        if priority_index in candidate.link.comes_from:
            priority = 1
        else:
            priority = 0

        return (has_allowed_hash, yank_value, binary_preference, priority,
                candidate.version, build_tag, pri)


pip._internal.index.package_finder.CandidateEvaluator = MyCandidateEvaluator

# Remove '' and current working directory from the first entry
# of sys.path, if present to avoid using current directory
# in pip commands check, freeze, install, list and show,
# when invoked as python -m pip <command>
if sys.path[0] in ('', os.getcwd()):
    sys.path.pop(0)

# If we are running from a wheel, add the wheel to sys.path
# This allows the usage python pip-*.whl/pip install pip-*.whl
if __package__ == '':
    # __file__ is pip-*.whl/pip/__main__.py
    # first dirname call strips of '/__main__.py', second strips off '/pip'
    # Resulting path is the name of the wheel itself
    # Add that to sys.path so we can import pip
    path = os.path.dirname(os.path.dirname(__file__))
    sys.path.insert(0, path)

from pip._internal.cli.main import main as _main  # isort:skip # noqa


if __name__ == '__main__':
    sys.exit(_main())

setup a requirements.txt

numpy
sampleproject

and call above script using the same parameters as you'd use for pip.

>python mypip.py install --no-cache --extra-index http://localhost:8000 -r requirements.txt
Looking in indexes: https://pypi.org/simple, http://localhost:8000
Collecting numpy
  Downloading numpy-1.19.1-cp37-cp37m-win_amd64.whl (12.9 MB)
     |████████████████████████████████| 12.9 MB 6.8 MB/s
Collecting sampleproject
  Downloading http://localhost:8000/sampleproject/sampleproject-0.5.0-py2.py3-none-any.whl (4.3 kB)
Collecting peppercorn
  Downloading peppercorn-0.6-py3-none-any.whl (4.8 kB)
Installing collected packages: numpy, peppercorn, sampleproject
Successfully installed numpy-1.19.1 peppercorn-0.6 sampleproject-0.5.0

Compare this to the default pip call

>pip install --no-cache --extra-index http://localhost:8000 -r requirements.txt
Looking in indexes: https://pypi.org/simple, http://localhost:8000
Collecting numpy
  Downloading numpy-1.19.1-cp37-cp37m-win_amd64.whl (12.9 MB)
     |████████████████████████████████| 12.9 MB 6.4 MB/s
Collecting sampleproject
  Downloading sampleproject-2.0.0-py3-none-any.whl (4.2 kB)
Collecting peppercorn
  Downloading peppercorn-0.6-py3-none-any.whl (4.8 kB)
Installing collected packages: numpy, peppercorn, sampleproject
Successfully installed numpy-1.19.1 peppercorn-0.6 sampleproject-2.0.0

And notice that mypip prefers a package if it can be retrieved from localhost; ofc you can customize this behavior further.

FirefoxMetzger
  • 2,880
  • 1
  • 18
  • 32
  • 1
    This relies on pip's internal (private) API, so it's not a stable solution. You would have to control every update of pip in order to make sure that the relevant parts are still in place. – a_guest Aug 18 '20 at 09:27
  • 1
    This is obviously not the "right" answer for almost all situations, but I like the creativity and the "if everything else fails, use a bigger hammer" approach - you get the bounty. A special thanks to @a_guest for the motivation ;-) – thebjorn Aug 19 '20 at 00:52
  • Seems very sound in its ideals; maybe craft a [PR to upstream](https://github.com/pypa/pip)! If another hack is OK, one could also use [`inspect` to make sure the text of `CandidateEvaluator` hasn't changed](https://docs.python.org/3/library/inspect.html#inspect.getsource) – ti7 Aug 19 '20 at 15:41