7

For one reason or another, I want to install a version of Apache Spark different from the one available on Google Cloud Dataproc. How can I install a custom version of Spark but also maintain compatibility with the Cloud Dataproc tooling?

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
James
  • 2,321
  • 14
  • 30

2 Answers2

7

In general, you should be able to install a custom version of Spark on Dataproc and maintain compatibility with the Cloud Dataproc tooling (Cloud Dataproc jobs, mainly.)

To do this, you should:

  1. Install spark in /usr/local/lib/spark or /opt/spark instead of the user home directory
  2. Don't modify the user .bashrc
  3. Uninstall the Cloud Dataproc-provided version of spark using apt-get remove
  4. Symlink /usr/local/bin/spark-submit to the provided binary (this is needed for the Cloud Dataproc jobs API to work with the new Spark install)
  5. Re-use the /etc/spark/conf provided by Cloud Dataproc
James
  • 2,321
  • 14
  • 30
2

In addition to the steps above I had to set SPARK_HOME via /etc/profile.d/

echo export SPARK_HOME=/opt/my/new/spark/ > /etc/profile.d/set_spark_home.sh

cslattery
  • 21
  • 4