make Pyspark working inside jupyterhub

Question

I have a machine with JupyterHub (Python2,Python3,R and Bash Kernels). I have Spark(scala) and off course PySpark working. I can even use PySpark inside an interactive IPython notebook with a command like:

IPYTHON_OPTS="notebook" $path/to/bin/pyspark

(this open a Jupyter notebook and inside Python2 I can use Spark)

BUT I can't get PySpark working inside JupyterHub.

the spark kernel is more than what i really need.

I only need Pyspark inside JupyterHub. Any suggestion ?

thanks.

score 6 · Answer 1 · answered Jun 14 '16 at 17:07

You need to configure the pyspark kernel.

On my server jupyter kernels are located at:

/usr/local/share/jupyter/kernels/

You can create a new kernel by making a new directory:

mkdir /usr/local/share/jupyter/kernels/pyspark

Then create the kernel.json file - I paste my as a reference:

{
 "display_name": "pySpark (Spark 1.6.0)",
 "language": "python",
 "argv": [
  "/usr/local/bin/python2.7",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "PYSPARK_PYTHON": "/usr/local/bin/python2.7",
  "SPARK_HOME": "/usr/lib/spark",
  "PYTHONPATH": "/usr/lib/spark/python/lib/py4j-0.9-src.zip:/usr/lib/spark/python/",
  "PYTHONSTARTUP": "/usr/lib/spark/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master yarn-client pyspark-shell"
 }
}

Adjust the paths and python versions and your pyspark kernel is good to go.

JupyterHub changed a lot from when i wrote this questions. I'll try your solution. thanks — arj, Jun 14 '16 at 18:51

score 4 · Answer 2 · answered Jul 21 '15 at 13:50

4

You could start jupyter as usual, and add the following to the top of your code:

import sys
sys.path.insert(0, '<path>/spark/python/')
sys.path.insert(0, '<path>/spark/python/lib/py4j-0.8.2.1-src.zip')
import pyspark
conf = pyspark.SparkConf().set<conf settings>
sc = pyspark.SparkContext(conf=conf)

and change the parts in angled brackets as appropriate for you.

answered Jul 21 '15 at 13:50

mdurant

27,272
5
45
74

I thought there was a way to let JupyterHub initalize transparently the SparkContext, as PySpark does. Maybe the solution is calling a Python kernel with some more argv. – arj Jul 22 '15 at 16:35
same result by exporting PYTHONPATH and disabling the secutiry control in mediator.py . – arj Aug 02 '15 at 14:08
1

Could you be more specific? It is possible to put this in a `profile`? – nanounanue Aug 24 '15 at 17:41

score 0 · Answer 3 · answered Jul 21 '15 at 13:20

0

I didn't try it with jupiter hub, but this approach helped me with other tools (like spyder)

I understand the jupiter server is itself a python script. so: copy (or rename) jupyterhub to jupyterhub.py

run:

spark-submit jupyterhub.py

(replace spark-submit and jupyterhub.py with the full path of those files)

answered Jul 21 '15 at 13:20

Ophir Yoktan

8,149
7
58
106

I think spark-submit is only for jar files. – arj Jul 22 '15 at 16:26
It's also for python scripts (at least in the newer versions) – Ophir Yoktan Jul 22 '15 at 16:27
I see. So in this way i run JupyterHub itself in a Spark Cluster (local, standalone, mesos or yarn) and it is supposed that opening a new python notebook will load the sparkcontext and the spark API. is it right? Oh i see from bin/pyspark after preparing the the variables it execute exec "$SPARK_HOME"/bin/spark-submit pyspark-shell-main "$@" – arj Jul 22 '15 at 16:44

score 0 · Answer 4 · answered Nov 04 '17 at 15:46

0

I have created a public gist to configure spark2.x with jupyterhub & cdh5.13 cluster.

answered Nov 04 '17 at 15:46

Abdul Mannan

1,072
12
19

make Pyspark working inside jupyterhub

4 Answers4