4

I'm trying to figure out how to use external libraries. I have a program that runs successfully on Spark, and I am now trying to import external libraries. I'm using virtualenv and every time I submit it, spark complains that it cannot find the file.

Here is one of many submit commands I have tried:

/path/to/spark-1.1.0-bin-hadoop2.4/bin/spark-submit ua_analysis.py --py-files `pwd`/venv/lib/python2.7/site-packages

I have tried adding the files individually with the --py-files flag, I've also tried the following subdirectories.

venv/lib
venv/python2.7
venv/lib/python2.7/site-packages/<package_name>

All of these produce the following error

ImportError: ('No module named <module>', <function subimport at 0x7f287255dc80>, (<module>,))

    org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
    org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
    org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
....

I've also tried copying these files to the pyspark directory to no success.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Peter Klipfel
  • 4,958
  • 5
  • 29
  • 44

2 Answers2

8

When you create virtual env, pass --system-site-packages option to virtualenv:

virtualenv --system-site-packages venv

If you forgot pass the option:

rm venv/lib/python2.7/no-global-site-packages.txt

By both ways, you can import system-site-packages in the virtual env.

kev
  • 155,172
  • 47
  • 273
  • 272
1

not sure the answer above is still valid in my case add to modify:

include-system-site-packages = false to include-system-site-packages = true

in my pyvenv.cfg file that is located in my specific virtualenv (i.e 'virtaulenv_number_1' folder. Now I can use libraries not present in my virtualenv but presents in the sytem wide python installation

pippo1980
  • 2,181
  • 3
  • 14
  • 30