How to set a custom environment variable in EMR to be available for a spark Application

Question

I need to set a custom environment variable in EMR to be available when running a spark application.

I have tried adding this:

                   ...
                   --configurations '[                                    
                                      {
                                      "Classification": "spark-env",
                                      "Configurations": [
                                        {
                                        "Classification": "export",
                                        "Configurations": [],
                                        "Properties": { "SOME-ENV-VAR": "qa1" }
                                        }
                                      ],
                                      "Properties": {}
                                      }
                                      ]'
                   ...

and also tried to replace "spark-env with hadoop-env but nothing seems to work.

There is this answer from the aws forums. but I can't figure out how to apply it. I'm running on EMR 5.3.1 and launch it with a preconfigured step from the cli: aws emr create-cluster...

franklinsijo · Answer 1 · 2017-02-22T16:48:18.423

11

Add the custom configurations like below JSON to a file say, custom_config.json

[   
  {
   "Classification": "spark-env",
   "Properties": {},
   "Configurations": [
       {
         "Classification": "export",
         "Properties": {
             "VARIABLE_NAME": VARIABLE_VALUE,
         }
       }
   ]
 }
]

And, On creating the emr cluster, pass the file reference to the --configurations option

aws emr create-cluster --configurations file://custom_config.json --other-options...

edited Feb 22 '17 at 16:48

answered Feb 22 '17 at 16:39

franklinsijo

17,784
4
45
63

1

This should be the exact same thing as what I did in the question, isn't it ? – NetanelRabinowitz Feb 23 '17 at 09:50
Yes, but within a file. – franklinsijo Feb 23 '17 at 09:52
4

Sounds like a bug if Its true. Didn't tried it yet but are you sure this is what actually made the difference ??? because when I look at the EMR UI under configuration (with the version from the question)I can see my variable is set.. Its just that the spark app can't – NetanelRabinowitz Feb 23 '17 at 09:59
Basically, AWS' EMR does not distribute spark-env.sh to worker nodes. Google's DataProc does. YARN is a sort-of hacky way of getting environmental variables in, though. Consider just passing them as app arguments into your spark job. – yegeniy Aug 09 '18 at 15:00

score 5 · Answer 2 · answered Nov 16 '17 at 11:20

5

For me replacing spark-env to yarn-env fixed issue.

answered Nov 16 '17 at 11:20

Przemek

208
2
8

the spark-env.sh file seems to get filled in by EMR, but maybe it doesn't get executed or something? Anyways, using yarn-env does seem to work. Maybe it's because we're running YARN, not spark directly? – yegeniy Aug 09 '18 at 01:44
1

Basically, AWS' EMR does not distribute `spark-env.sh` to worker nodes. Google's DataProc does. YARN is a sort-of hacky way of getting environmental variables in, though. Consider just passing them as arguments into your spark job. – yegeniy Aug 09 '18 at 14:58

rwitzel · Answer 3 · 2020-06-16T15:01:09.610

1

Use classification yarn-env to pass environment variables to the worker nodes.

Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.

edited Jun 16 '20 at 15:01

answered Jun 16 '20 at 12:47

rwitzel

1,694
17
21

score 0 · Answer 4 · answered Jul 19 '23 at 14:26

For EMR 6.11.0, and running YARN in cluster mode, I had to use spark-defaults as per the docs.

Example of custom configurations JSON below, setting two environment variables MY_ENV_VAR and ANOTHER_ENV_VAR.

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.port": "999",  # example of other configs
      "spark.yarn.appMasterEnv.MY_ENV_VAR": "value",
      "spark.yarn.appMasterEnv.ANOTHER_ENV_VAR": "another_value",
    }
  }
]

How to set a custom environment variable in EMR to be available for a spark Application

4 Answers4

Linked