13

I have similar import error on Spark executors as described here, just with psycopg2: ImportError: No module named numpy on spark workers

Here it says "Although pandas is too complex to distribute as a *.py file, you can create an egg for it and its dependencies and send that to executors".

So the question is "How to create egg file from package and it dependencies?" Or wheel, in case eggs are legacy. Is there any command for this in pip?

Bunyk
  • 7,635
  • 8
  • 47
  • 79

2 Answers2

6

You want to be making a wheel. They are newer, more robust than eggs, and are supported by both Python 2/3.

For something as popular as numpy, you don't need to bother making the wheel yourself. They package wheels in their distribution, so you can just download it. Many python libraries will have a wheel as part of their distribution. See here: https://pypi.python.org/pypi/numpy

If you're curious, see here how to make one in general: https://pip.pypa.io/en/stable/reference/pip_wheel/.

Alternatively, you could just install numpy on your target workers.

EDIT:

After your comments, I think it's pertinent to mention the pipdeptree utility. If you need to see by hand what the pip dependencies are, this utility will list them for you. Here's an example:

$ pipdeptree
3to2==1.1.1
anaconda-navigator==1.2.1
ansible==2.2.1.0
  - jinja2 [required: <2.9, installed: 2.8]
    - MarkupSafe [required: Any, installed: 0.23]
  - paramiko [required: Any, installed: 2.1.1]
    - cryptography [required: >=1.1, installed: 1.4]
      - cffi [required: >=1.4.1, installed: 1.6.0]
        - pycparser [required: Any, installed: 2.14]
      - enum34 [required: Any, installed: 1.1.6]
      - idna [required: >=2.0, installed: 2.1]
      - ipaddress [required: Any, installed: 1.0.16]
      - pyasn1 [required: >=0.1.8, installed: 0.1.9]
      - setuptools [required: >=11.3, installed: 23.0.0]
      - six [required: >=1.4.1, installed: 1.10.0]
    - pyasn1 [required: >=0.1.7, installed: 0.1.9]
  - pycrypto [required: >=2.6, installed: 2.6.1]
  - PyYAML [required: Any, installed: 3.11]
  - setuptools [required: Any, installed: 23.0.0

If you're using Pyspark and need to package your dependencies, pip can't do this for you automatically. Pyspark has its own dependency management that pip knows nothing about. The best you can do is list the dependencies and shove them over by hand, as far as I know.

Additionally, Pyspark isn't dependent on numpy or psycopg2, so pip can't possibly tell you that you'd need them if all you're telling pip is your version of Pyspark. That dependency has been introduced by you, so you're responsible for giving it to Pyspark.

As a side note, we use bootstrap scripts that install our dependencies (like numpy) before we boot our clusters. It seems to work well. That way you list the libs you need once in a script, and then you can forget about it.

HTH.

Matt Messersmith
  • 12,939
  • 6
  • 51
  • 52
  • 1
    Yes, saw that. But there is a lot of that wheels, and it is hard to choose the right one for my system. pip somehow chooses it for me. Is there any way to tell him load it? – Bunyk Oct 10 '17 at 13:40
  • Did you try it with any of the existing wheels? Numpy compatibility between versions is pretty good. The interfaces are relatively stable. – Matt Messersmith Oct 10 '17 at 13:47
  • Yes I tried with psycopg2-2.7.3.1-cp27-cp27mu-manylinux1_x86_64.whl and it worked for me. I'm just asking if there is way to get that with pip, not with wget, because it is hard to figure out if I need x86_64, or i686, or something else. pip somehow knows exact package. – Bunyk Oct 10 '17 at 13:56
  • Hi @Bunyk, when you said that "psycopg2-2.7.3.1-cp27-cp27mu-manylinux1_x86_64.whl" has worked, did you mean that you have to install it on all the nodes of the cluster (with a bootstrap script or pip) or you were able to ship it as --py-files, or something similar directly on Spark? Thanks – pippobaudos Jul 15 '18 at 10:39
  • 1
    @pippobaudos Oh, that was almost a year ago, so I could remember it incorrectly, but I guess I somehow shipped wheel as --py-files argument of pyspark. Because on nodes you could install it manuall in any way you want, wheels, eggs or building from source. – Bunyk Jul 15 '18 at 11:07
3

You can install wheel using pip install wheel.

Then create a .whl using python setup.py bdist_wheel. You'll find it in the dist directory in root directory of the python package. You might also want to pass --universal if you want a single .whl file for both python 2 and python 3.

More info on wheel.

ritiek
  • 2,477
  • 2
  • 18
  • 25