I'm trying to figure out how to use external libraries. I have a program that runs successfully on Spark, and I am now trying to import external libraries. I'm using virtualenv
and every time I submit it, spark complains that it cannot find the file.
Here is one of many submit commands I have tried:
/path/to/spark-1.1.0-bin-hadoop2.4/bin/spark-submit ua_analysis.py --py-files `pwd`/venv/lib/python2.7/site-packages
I have tried adding the files individually with the --py-files
flag, I've also tried the following subdirectories.
venv/lib
venv/python2.7
venv/lib/python2.7/site-packages/<package_name>
All of these produce the following error
ImportError: ('No module named <module>', <function subimport at 0x7f287255dc80>, (<module>,))
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
....
I've also tried copying these files to the pyspark
directory to no success.