4

I would like an environment variable to be set on each node of my dataproc cluster so that it is available to a pyspark job that will be running on that cluster. What is the best way to do this?

I'm wondering if there is a way to do it using Compute Engine metadata (although my research so far indicates that Compute Engine metadata is available via the metadata server on Compute Engine instances, not via environment variables).

Other than that I can't think of a way of doing it other than issuing an export command in a dataproc initialisation script.

Can anyone suggest any other alternatives?

Dagang
  • 24,586
  • 26
  • 88
  • 133
jamiet
  • 10,501
  • 14
  • 80
  • 159

3 Answers3

11

OS level

Dataproc doesn't have first-class support for OS level custom environment variables which apply to all processes, but you can achieve it with init actions by adding your env variables to /etc/environment. You might need to restart the services in the init action.

Hadoop and Spark services

For Hadoop and Spark services, you can set properties with hadoop-env or spark-env prefix when creating the cluster, for example:

gcloud dataproc clusters create
    --properties hadoop-env:FOO=hello,spark-env:BAR=world
    ...

See this doc for more details.

Spark jobs

Spark allows setting env variables at job level. For executors, you can always use spark.executorEnv.[Name] to set env variables, but for drivers there is a difference depending on whether you are running the job in cluster mode or client mode.

Client mode (default)

In client mode, driver env variables need to be set in spark-env.sh when creating the cluster. You can use --properties spark-env:[NAME]=[VALUE] as described above.

Executor env variables can be set when submitting the job, for example:

gcloud dataproc jobs submit spark \
    --properties spark.executorEnv.BAR=world
    ...

or

spark-submit --conf spark.executorEnv.BAR=world ...

Cluster mode

In cluster mode, driver env variables can be set with spark.yarn.appMasterEnv.[NAME], for example:

gcloud dataproc jobs submit spark \
    --properties spark.submit.deployMode=cluster,spark.yarn.appMasterEnv.FOO=hello,spark.executorEnv.BAR=world
    ...

or

spark-submit \
    --deploy-mode cluster
    --conf spark.yarn.appMasterEnv.FOO=hello \
    --conf spark.executorEnv.BAR=world \
    ...

See this doc for more details.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • I also had to restart each VM. – Job Evers Aug 18 '20 at 14:46
  • I have tried adding `--properties spark-env:spark.executorEnv.MY_CUSTOM=abc` to the gcloud command, but somehow I am not able to read the env var within my main function (using Scala). I am using this code to read the env var: `System.getenv("MY_CUSTOM")`. `getenv` always returns `null`. Am I missing something? – Raman Feb 19 '23 at 17:06
2

You can use GCE Metadata, then a startup-script-url to write to /etc/environment.

gcloud dataproc clusters create NAME \
  --metadata foo=bar,startup-script-url=gs://some-bucket/startup.sh \
  ...

gs://some-bucket/startup.sh

#!/usr/bin/env bash

ENV_VAR=$(curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/foo" -H "Metadata-Flavor: Google")
echo "foo=${ENV_VAR}" >> /etc/environment

Hope it helps...

1

There is no sort of cluster level env variable in Dataproc, however most components have their own env variable settings and you can set those thru dataproc Properties

Henry Gong
  • 306
  • 1
  • 3
  • Thanks, but that’s not what I’m referring to. I don’t want to set an env var for one of the known components, I want to set a custom env var. – jamiet Apr 14 '20 at 19:01
  • I tried this approach but it does not set the env vars when you call: gcloud dataproc jobs submit spark --properties spark-env:... It seems to work for setting properties though. – markus Mar 08 '21 at 21:13