How can I install a custom version of Apache Spark on Cloud Dataproc

Question

For one reason or another, I want to install a version of Apache Spark different from the one available on Google Cloud Dataproc. How can I install a custom version of Spark but also maintain compatibility with the Cloud Dataproc tooling?

score 7 · Answer 1 · answered Apr 12 '18 at 07:54

7

In general, you should be able to install a custom version of Spark on Dataproc and maintain compatibility with the Cloud Dataproc tooling (Cloud Dataproc jobs, mainly.)

To do this, you should:

Install spark in /usr/local/lib/spark or /opt/spark instead of the user home directory
Don't modify the user .bashrc
Uninstall the Cloud Dataproc-provided version of spark using apt-get remove
Symlink /usr/local/bin/spark-submit to the provided binary (this is needed for the Cloud Dataproc jobs API to work with the new Spark install)
Re-use the /etc/spark/conf provided by Cloud Dataproc

answered Apr 12 '18 at 07:54

James

2,321
14
30

would this happen during the custom-image build script or the initiation script? – pavbagel Oct 07 '19 at 21:57
did it in init script – pavbagel Oct 08 '19 at 23:47
@pavbagel do you mind sharing your init script somewhere? (pastebin or github gist) – Sam Jul 28 '22 at 00:39
Don't have it anymore, but found that doing it in the init script was the easiest way. – pavbagel Jul 29 '22 at 18:28

score 2 · Answer 2 · answered Feb 26 '19 at 09:03

2

In addition to the steps above I had to set SPARK_HOME via /etc/profile.d/

echo export SPARK_HOME=/opt/my/new/spark/ > /etc/profile.d/set_spark_home.sh

answered Feb 26 '19 at 09:03

cslattery

21
4

did you do this on the custom-image build script or the initiation script? – pavbagel Oct 07 '19 at 21:58
did it in init script – pavbagel Oct 08 '19 at 23:46
yep, init script – cslattery Oct 09 '19 at 07:58

How can I install a custom version of Apache Spark on Cloud Dataproc

2 Answers2