OS level
Dataproc doesn't have first-class support for OS level custom environment variables which apply to all processes, but you can achieve it with init actions by adding your env variables to /etc/environment
. You might need to restart the services in the init action.
Hadoop and Spark services
For Hadoop and Spark services, you can set properties with hadoop-env
or spark-env
prefix when creating the cluster, for example:
gcloud dataproc clusters create
--properties hadoop-env:FOO=hello,spark-env:BAR=world
...
See this doc for more details.
Spark jobs
Spark allows setting env variables at job level. For executors, you can always use spark.executorEnv.[Name]
to set env variables, but for drivers there is a difference depending on whether you are running the job in cluster mode or client mode.
Client mode (default)
In client mode, driver env variables need to be set in spark-env.sh when creating the cluster. You can use --properties spark-env:[NAME]=[VALUE]
as described above.
Executor env variables can be set when submitting the job, for example:
gcloud dataproc jobs submit spark \
--properties spark.executorEnv.BAR=world
...
or
spark-submit --conf spark.executorEnv.BAR=world ...
Cluster mode
In cluster mode, driver env variables can be set with spark.yarn.appMasterEnv.[NAME]
, for example:
gcloud dataproc jobs submit spark \
--properties spark.submit.deployMode=cluster,spark.yarn.appMasterEnv.FOO=hello,spark.executorEnv.BAR=world
...
or
spark-submit \
--deploy-mode cluster
--conf spark.yarn.appMasterEnv.FOO=hello \
--conf spark.executorEnv.BAR=world \
...
See this doc for more details.