8

I'm trying to build a multistage docker image with some python packages. For some reason, pip wheel command still downloads source files .tar.gz for few packages even though .whl files exist in Pypi. For example: it does it for pandas, numpy.

Here is my requirements.txt:

# REST client
requests

# ETL
pandas

# SFTP
pysftp
paramiko

# LDAP
ldap3

# SMB
pysmb

First stage of the Dockerfile:

ARG IMAGE_TAG=3.7-alpine
FROM python:${IMAGE_TAG} as python-base
COPY ./requirements.txt /requirements.txt
RUN mkdir /wheels && \
    apk add build-base openssl-dev pkgconfig libffi-dev
RUN pip wheel --wheel-dir=/wheels --requirement /requirements.txt
ENTRYPOINT tail -f /dev/null

Output below shows that it is downloading source package for Pandas but it got a wheel for Requests package. Also, surprisingly it takes a lot of time (I really mean a lot of time) to download and build these packages !!

Step 5/11 : RUN pip wheel --wheel-dir=/wheels --requirement /requirements.txt
 ---> Running in d7bd8b3bd471
Collecting requests (from -r /requirements.txt (line 4))
  Downloading https://files.pythonhosted.org/packages/51/bd/23c926cd341ea6b7dd0b2a00aba99ae0f828be89d72b2190f27c11d4b7fb/requests-2.22.0-py2.py3-none-any.whl (57kB)
  Saved /wheels/requests-2.22.0-py2.py3-none-any.whl
Collecting pandas (from -r /requirements.txt (line 7))
  Downloading https://files.pythonhosted.org/packages/0b/1f/8fca0e1b66a632b62cc1ae38e197befe48c5cee78f895edf4bf8d340454d/pandas-0.25.0.tar.gz (12.6MB)

I would like to know how I can force it get a wheel file for all the required packages and also for the dependencies listed in these packages. I observed that some dependencies get a wheel file but others get the source packages.

NOTE: code above is a combination of multiple online sources.

Any help to make this build process easier is greatly appreciated.

Thanks in Advance.

inblueswithu
  • 953
  • 2
  • 18
  • 29

2 Answers2

6
  1. You are using Alpine Linux. This one is somewhat unique as it uses musl as the underlying libc implementation, as opposed to the most other Linux distros which use glibc.

  2. If a Python project implements C extensions (this is what e.g. numpy or pandas do), it has two options: either

    • offer a source dist (.tar.gz, .tar.bz2 or .zip) so that the C extensions are compiled using the C compiler/library found on the target system, or
    • offer a wheel that contains compiled C extensions. If the extensions are compiled against glibc, they will be unusable on systems using musl, and AFAIK vice versa too.

Now, Python defines the manylinux1 platform tag which is specified in PEP 513 and updated in PEP 571. Basically, the name says it all - wheels with compiled C extensions should be built against glibc and thus will work on many distros (that use glibc), but not on some (Alpine being one of them).

For you, it means that you have two possibilities: either build packages from source dists (this is what pip already does), or install the prebuilt packages via Alpine's package manager. E.g. for py3-pandas it would mean doing:

# echo "@edge http://dl-cdn.alpinelinux.org/alpine/edge/testing" >> /etc/apk/repositories
# apk update
# apk add py3-pandas@edge

However, I don't see a big issue with building packages from source. When done right, you capture it in a separate layer placed as high as possible in the image, so it is cached and not rebuilt each time.


You might ask, why there's no platform tag analogous to manylinux1, but for musl-based distros? Because no one has written a PEP similar to PEP 513 that defines a musllinux platform tag yet. If you are interested in the current state of it, take a look at the issue #37.


Update

PEP 656 That defines a musllinux platform tag is now accepted, so it (hopefully) won't last long until prebuilt wheels for Alpine start to ship. You can track the current implementation state in auditwheel#305.

hoefling
  • 59,418
  • 12
  • 147
  • 194
  • This clears things a lot. Thanks. So, the reason `requests` package is wheel when downloaded is because it is a pure python package without any C bindings i.e. it is independent of the platform (glibc or musl <--> debian or alpine). Correct? – inblueswithu Aug 14 '19 at 13:28
  • 1
    Exactly. You will recognize the platform independent wheels by the file name ending with `-any.whl`, while platform specific wheels specify the target platform with bitness, e.g. `-manylinux1_i686.whl` is for glibc-based 32-bit Linux, `-win_amd64.whl` is for 64-bit Windows etc. – hoefling Aug 14 '19 at 14:02
-2

For Python 3, your packages will be installed from wheels with ordinary pip call:

pip install pandas numpy

From the docs:

Pip prefers Wheels where they are available. To disable this, use the --no-binary flag for pip install.

If no satisfactory wheels are found, pip will default to finding source archives.

Community
  • 1
  • 1
ipaleka
  • 3,745
  • 2
  • 13
  • 33
  • That is what I expected it to do with `pip wheel` as well but it does not. Also, I&#39;m doing a multistage build container and trying to copy all the wheels to my second stage to make it small. I can include the second stage of the docker file later if that helps. I can't use `pip install` because it still needs the C bindings even after I move the files to final container, in short, it breaks the packages. – inblueswithu Aug 13 '19 at 22:38
  • 2
    That's a wrong expectation, pip wheel by definition should download source and try to build wheel from it - building wheels from wheels doesn't make sense. For your use case I suggest `pip download` to download those wheels. – ipaleka Aug 13 '19 at 22:49
  • I'll check that option and update my results. Thanks! – inblueswithu Aug 13 '19 at 23:17
  • If that is the case, then why does the *requests* package was downloaded as a wheel and *pandas* didn't ? – inblueswithu Aug 13 '19 at 23:28
  • Really don't now, but the output is "Skipping requests, due to already being wheel." – ipaleka Aug 14 '19 at 09:21