Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?
Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?
There is indeed a way of getting Toree to connect to a remote Spark cluster.
The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.
Steps:
Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/
, so I performed the following command:
cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
Edit the new kernel.json
file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
and add the requisite Spark options to the __TOREE_SPARK_OPTS__
variable. Technically, only --master <path>
is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.
Restart Jupyter.
My kernel.json file looks like this:
{
"display_name": "Toree - Scala Remote",
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala",
"env": {
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
"SPARK_HOME": "/opt/spark",
"DEFAULT_INTERPRETER": "Scala",
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
}
}
This is a possible example with some intuitive details for ANY remote cluster install. For my remote cluster, which is a Cloudera 5.9.2 these are specific steps. (You can also use this example to install with non-Cloudera clusters with some smart edits.)
With OS/X to build CDH version (skip if using a distribution):
Goto https://github.com/Myllyenko/incubator-toree and clone this repo
Download Docker
Setup 'signing' - It's been a some time since I set this up - you'll need to sign the build above. TBD
'new branch git', edit the .travis.xml, README.md, and build.sbt files to change 5.10.x to 5.9.2
Start Docker, cd within the make release
dir, build with make release
, wait, wait, sign 3 builds
Copy the file ./dist/toree-pip/toree-0.2.0-spark-1.6.0-cdh5.9.2.tar.gz
to your spark-shell machine that can reach your YARN-controlled Spark cluster
Merge, commit, etc your repo to your master repo if this will be mission critical
Spark Machine Installs:
Warning: Some steps may need to be done as root as a last resort
Install pip / anaconda (see other docs)
Install Jupyter sudo pip install jupyter
Install toree sudo pip install toree-0.2.0-spark-1.6.0-cdh5.9.2
or use the apache-toree distribution
Configure Toree to run with Jupyter (example):
Edit & add to ~/.bash_profile
echo $PATH
PATH=$PATH:$HOME/bin
export PATH
echo $PATH
export CDH_SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
PATH=$PATH:$SPARK_HOME/bin
export PATH
echo $PATH
export SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
com.databricks:spark-csv_2.10:1.5.0
END
)
export SPARK_JARS=$(cat << END | xargs echo | sed 's/ /,/g'
/home/mymachine/extras/someapp.jar
/home/mymachine/extras/jsoup-1.10.3.jar
END
)
export TOREE_JAR="/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.2.0-spark-1.6.0-cdh5.9.2-incubating.jar"
export SPARK_OPTS="--master yarn-client --conf spark.yarn.config.gatewayPath=/opt/cloudera/parcels --conf spark.scheduler.mode=FAIR --conf spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.historyServer.address=http://yourCDHcluster.net:18088 --conf spark.default.parallelism=20 --conf spark.driver.maxResultSize=1g --conf spark.driver.memory=1g --conf spark.executor.cores=4 --conf spark.executor.instances=5 --conf spark.executor.memory=1g --packages $SPARK_PKGS --jars $SPARK_JARS"
function jti() {
jupyter toree install \
--replace \
--user \
--kernel_name="CDH 5.9.2 Toree" \
--debug \
--spark_home=${SPARK_HOME} \
--spark_opts="$SPARK_OPTS" \
--log-level=0
}
function jn() {
jupyter notebook --ip=127.0.0.1 --port=8888 --debug --log-level=0
}
If you want a different port to hit Toree - now is your chance to edit 8888
Log out of your Toree / spark-shell machine
ssh back to that machine ssh -L 8888:localhost:8888 toreebox.cdhcluster.net
(assuming that 8888
is the port in the bash file)
I expect as a user (not root) you can type jti
to install Toree into Jupyter (Note: understanding this step may help to install other kernels into Jupyter - sidebar: @jamcom mentioned
the produced file, but this step automatically produces this part. The file is buried in your home dir's tree as a user rather than root.
As user, type jn
to start a Jupyter Notebook. Wait a few seconds until the browser url is available and paste that URL into your browser.
You now have Jupyter running and so pick a new CDH 5.9.2 Toree
or the version you installed. This launches a new browser window. Since you have some Toree experience, pick something like sc.getConf.getAll.sortWith(_._1 < _._1).foreach(println)
in order to get the lazily instantiated spark context going. Be really patient as your jobs is submitted to the cluster and your may have to wait a long time if your cluster is busy or a little while for your job to process in the cluster.
Tips and Tricks:
I ran into an issue on the first run and the subsequent runs never saw that issue. (The issue issue might be fixed in the github)
Sometimes, I have to kill the old 'Apache Toree' app on YARN to start a new Toree.
Sometimes, my VM can has an orphaned JVM. If you get memory errors starting a Jupyter Notebook/ Toree or have unexpectedly disconnected, check your process list with top
. And ... kill the extra JVM (be careful ID-ing your lost process).