For one reason or another, I want to install a version of Apache Spark different from the one available on Google Cloud Dataproc. How can I install a custom version of Spark but also maintain compatibility with the Cloud Dataproc tooling?
Asked
Active
Viewed 530 times
2 Answers
7
In general, you should be able to install a custom version of Spark on Dataproc and maintain compatibility with the Cloud Dataproc tooling (Cloud Dataproc jobs, mainly.)
To do this, you should:
- Install spark in
/usr/local/lib/spark
or/opt/spark
instead of the user home directory - Don't modify the user
.bashrc
- Uninstall the Cloud Dataproc-provided version of spark using
apt-get remove
- Symlink
/usr/local/bin/spark-submit
to the provided binary (this is needed for the Cloud Dataproc jobs API to work with the new Spark install) - Re-use the
/etc/spark/conf
provided by Cloud Dataproc

James
- 2,321
- 14
- 30
-
would this happen during the custom-image build script or the initiation script? – pavbagel Oct 07 '19 at 21:57
-
did it in init script – pavbagel Oct 08 '19 at 23:47
-
@pavbagel do you mind sharing your init script somewhere? (pastebin or github gist) – Sam Jul 28 '22 at 00:39
-
Don't have it anymore, but found that doing it in the init script was the easiest way. – pavbagel Jul 29 '22 at 18:28
2
In addition to the steps above I had to set SPARK_HOME via /etc/profile.d/
echo export SPARK_HOME=/opt/my/new/spark/ > /etc/profile.d/set_spark_home.sh

cslattery
- 21
- 4
-
did you do this on the custom-image build script or the initiation script? – pavbagel Oct 07 '19 at 21:58
-
-