7

I have a python package with many modules built into an .egg file and I want to use this inside zeppelin notebook. Acc to the zeppelin documentation, to pass this package to zeppelin spark interpreter, you can export it through --files option in SPARK_SUBMIT_OPTIONS in conf/zeppelin-env.sh. I have the following questions regarding this:

  1. In pyspark shell, .egg file given with --py-files is working (i.e I am able to import the module inside the package inside the pyspark shell), while the same .egg file with --files option is not working (ImportError: No module named XX.xx)

  2. Adding the .egg file via --py-files option in SPARK_SUBMIT_OPTIONS in zeppelin causes error: Error: --py-files given but primary resource is not a Python script. As per my understanding, whatever being given in SPARK_SUBMIT_OPTIONS is passed to spark-submit command, but why is --py-files throwing error?

  3. When I add the .egg through the --files option in SPARK_SUBMIT_OPTIONS , zeppelin notebook is not throwing error, but I am not able to import the module inside the zeppelin notebook.

What's the correct way to pass an .egg file zeppelin spark intrepreter?

Spark version is 1.6.2 and zeppelin version is 0.6.0

The zepplein-env.sh file contains the follwing:

export SPARK_HOME=/home/me/spark-1.6.1-bin-hadoop2.6
export SPARK_SUBMIT_OPTIONS="--jars /home/me/spark-csv-1.5.0-s_2.10.jar,/home/me/commons-csv-1.4.jar --files /home/me/models/Churn-zeppelin/package/build/dist/fly_libs-1.1-py2.7.egg"
Meethu Mathew
  • 431
  • 1
  • 6
  • 15
  • 1
    Which versions of Spark/Zepplin are you using? – Yaron Jan 31 '17 at 11:56
  • Can you post the whole command as well? – 1ambda Jan 31 '17 at 12:40
  • Spark version 1.6.2 and zeppelin 0.6.0 – Meethu Mathew Feb 01 '17 at 03:52
  • 2
    @MeethuMathew: Did you resolve this in the end? I see exactly the same as you with Zeppelin 0.7.0 pointing at Spark 2.1.0. I see you also raised the JIRA issue at https://issues.apache.org/jira/browse/ZEPPELIN-2136. Yes, the `sc.addPyFile()` works around it but it's not really practical for sharing multiple dependencies with other users of Zeppelin. Another workaround is to install your eggs in site-packages on all nodes and then use the `PYSPARK_PYTHON` variable in `zeppelin-env.sh` to give the path to python (including the executable itself). The path must be the same on all nodes. – snark Jan 04 '18 at 15:31
  • @snark the sc.addPyFile() – Meethu Mathew Feb 07 '18 at 09:56

1 Answers1

0

if you have a python dependecy package spark_utils.zip contains

src/hdfs_utils.py

and you can use it with 2 way

sc.addPyFile

%pyspark

sc.addPyFile("/mnt/zeppelin/python-packages/spark_utils.zip")
from src.hdfs_utils import get_hdfs_paths, delete_path

spark.submit.pyFiles

in another way, config spark.submit.pyFiles

%spark.conf

# python venv
PYSPARK_PYTHON               /mnt/zeppelin/python-venv/bin/python
PYSPARK_DRIVER_PYTHON        /mnt/zeppelin/python-venv/bin/python

# dependency
spark.submit.pyFiles /mnt/zeppelin/python-packages/spark_utils.zip

and you can just use it like this:

%pyspark

from src.hdfs_utils import get_hdfs_paths, delete_path

the first use dynamic load for develop

the scond use pre defined pakcage in config file in prod

geosmart
  • 518
  • 4
  • 15