2

I have two libraries: Pandas and utils (my library), and I want to import in my code. Since I was testing Pandas does not work as well.

Using boto3 and requests (without being preinstalled in the cluster) it works creating two zip files:

  • libs.zip: with boto3 and requests
  • dependencies.zip: utils

So, I import Pandas using a requirements file and creating a zip with all Pandas dependencies. I've tried importing the zip file within the code, like:

sc.addPyFile("libs.zip")

and the spark submit is like:

spark-submit --deploy-mode client --py-files s3://${BUCKET_NAME}/libs.zip s3://${BUCKET_NAME}/main.py

I tried a lot to submit a spark job in EMR cluster and I don't have any idea about this issue:

Traceback (most recent call last):
  File "/mnt/tmp/spark-xxxx/main.py", line 20, in <module>
    import pandas as pd
  File "/mnt/tmp/spark-xxxx/userFiles-xxxx/libs.zip/pandas/__init__.py", line 17, in <module>
ImportError: Unable to import required dependencies:
numpy:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.7 from "/usr/bin/python3"
  * The NumPy version is: "1.19.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'

How can I import Pandas and another library (created by me) in spark submit.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
  • As you are working on EMR, you can leverage bootstrap actions to install all the dependencies. See one of my post https://stackoverflow.com/questions/36217090/how-do-i-get-python-libraries-in-pyspark/39360646#39360646 where I have explained how to install various libraries during bootstrap actions. – Hussain Bohra Dec 23 '20 at 16:25
  • @HussainBohra I would like to use as you said, but I need to run using the library within a zip. – Guilherme Ferreira Dec 23 '20 at 17:37
  • are you including numpy in your requirements? If I import numpy then type numpy.core._multiarray_umath in ipython I get <........virtualenvs\\code-qxrssbyv\\lib\\site-packages\\numpy\\core\\_multiarray_umath.cp38-win_amd64.pyd'> So I know that particular numpy module is available. – Jonathan Leon Dec 24 '20 at 03:09
  • @JonathanLeon Yes. I include the version 1.19.3 for tests. If I don't include numpy, then pip will install automatically the latest version 1.19.4. – Guilherme Ferreira Dec 28 '20 at 13:01
  • Does this answer your question? [pyspark addPyFile to add zip of .py files, but module still not found](https://stackoverflow.com/questions/51450462/pyspark-addpyfile-to-add-zip-of-py-files-but-module-still-not-found) – Gonçalo Peres May 29 '21 at 06:24

0 Answers0