2

I want to be able to set the following env variables while submitting a job via dataproc submit:

  1. SPARK_HOME
  2. PYSPARK_PYTHON
  3. SPARK_CONF_DIR
  4. HADOOP_CONF_DIR

How can I achieve that?

figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
  • Can you clarify what your goal is when setting these variables? In general Dataproc will configure the environment for jobs so that e.g. SPARK_HOME is set correctly. Are you trying to override the default locations? – Jerry Ding Jan 06 '22 at 20:18
  • Thank you @JerryDing for your time :) Dataproc is not available with pyspark 3.2. Pyspark 3.2.0 released pandas API for pyspark and I have to write our pipelines for it. So, I am creating the cluster with an env yaml that gets pyspark installed as a package in it. Then I am overriding the above-mentioned env variables to use this pyspark 3.2.0. Please suggest improvements/suggestions. – figs_and_nuts Jan 07 '22 at 03:03

1 Answers1

1

Check the doc Setting environment variables on Dataproc cluster nodes on how to set env variables for different components in Dataproc.

Dagang
  • 24,586
  • 26
  • 88
  • 133