3

I have a machine with JupyterHub (Python2,Python3,R and Bash Kernels). I have Spark(scala) and off course PySpark working. I can even use PySpark inside an interactive IPython notebook with a command like:

IPYTHON_OPTS="notebook" $path/to/bin/pyspark

(this open a Jupyter notebook and inside Python2 I can use Spark)

BUT I can't get PySpark working inside JupyterHub.

the spark kernel is more than what i really need.

I only need Pyspark inside JupyterHub. Any suggestion ?

thanks.

lmtx
  • 5,018
  • 3
  • 20
  • 29
arj
  • 713
  • 2
  • 12
  • 26

4 Answers4

6

You need to configure the pyspark kernel.

On my server jupyter kernels are located at:

/usr/local/share/jupyter/kernels/

You can create a new kernel by making a new directory:

mkdir /usr/local/share/jupyter/kernels/pyspark

Then create the kernel.json file - I paste my as a reference:

{
 "display_name": "pySpark (Spark 1.6.0)",
 "language": "python",
 "argv": [
  "/usr/local/bin/python2.7",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "PYSPARK_PYTHON": "/usr/local/bin/python2.7",
  "SPARK_HOME": "/usr/lib/spark",
  "PYTHONPATH": "/usr/lib/spark/python/lib/py4j-0.9-src.zip:/usr/lib/spark/python/",
  "PYTHONSTARTUP": "/usr/lib/spark/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master yarn-client pyspark-shell"
 }
}

Adjust the paths and python versions and your pyspark kernel is good to go.

lmtx
  • 5,018
  • 3
  • 20
  • 29
4

You could start jupyter as usual, and add the following to the top of your code:

import sys
sys.path.insert(0, '<path>/spark/python/')
sys.path.insert(0, '<path>/spark/python/lib/py4j-0.8.2.1-src.zip')
import pyspark
conf = pyspark.SparkConf().set<conf settings>
sc = pyspark.SparkContext(conf=conf)

and change the parts in angled brackets as appropriate for you.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I thought there was a way to let JupyterHub initalize transparently the SparkContext, as PySpark does. Maybe the solution is calling a Python kernel with some more argv. – arj Jul 22 '15 at 16:35
  • same result by exporting PYTHONPATH and disabling the secutiry control in mediator.py . – arj Aug 02 '15 at 14:08
  • 1
    Could you be more specific? It is possible to put this in a `profile`? – nanounanue Aug 24 '15 at 17:41
0

I didn't try it with jupiter hub, but this approach helped me with other tools (like spyder)

I understand the jupiter server is itself a python script. so: copy (or rename) jupyterhub to jupyterhub.py

run:

spark-submit jupyterhub.py

(replace spark-submit and jupyterhub.py with the full path of those files)

Ophir Yoktan
  • 8,149
  • 7
  • 58
  • 106
  • I think spark-submit is only for jar files. – arj Jul 22 '15 at 16:26
  • It's also for python scripts (at least in the newer versions) – Ophir Yoktan Jul 22 '15 at 16:27
  • I see. So in this way i run JupyterHub itself in a Spark Cluster (local, standalone, mesos or yarn) and it is supposed that opening a new python notebook will load the sparkcontext and the spark API. is it right? Oh i see from bin/pyspark after preparing the the variables it execute exec "$SPARK_HOME"/bin/spark-submit pyspark-shell-main "$@" – arj Jul 22 '15 at 16:44
0

I have created a public gist to configure spark2.x with jupyterhub & cdh5.13 cluster.

Abdul Mannan
  • 1,072
  • 12
  • 19