I have two libraries: Pandas and utils (my library), and I want to import in my code. Since I was testing Pandas does not work as well.
Using boto3
and requests
(without being preinstalled in the cluster) it works creating two zip files:
- libs.zip: with
boto3
andrequests
- dependencies.zip:
utils
So, I import Pandas using a requirements file and creating a zip with all Pandas dependencies. I've tried importing the zip file within the code, like:
sc.addPyFile("libs.zip")
and the spark submit is like:
spark-submit --deploy-mode client --py-files s3://${BUCKET_NAME}/libs.zip s3://${BUCKET_NAME}/main.py
I tried a lot to submit a spark job in EMR cluster and I don't have any idea about this issue:
Traceback (most recent call last):
File "/mnt/tmp/spark-xxxx/main.py", line 20, in <module>
import pandas as pd
File "/mnt/tmp/spark-xxxx/userFiles-xxxx/libs.zip/pandas/__init__.py", line 17, in <module>
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
Please note and check the following:
* The Python version is: Python3.7 from "/usr/bin/python3"
* The NumPy version is: "1.19.4"
and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.
Original error was: No module named 'numpy.core._multiarray_umath'
How can I import Pandas and another library (created by me) in spark submit.