15

I need to set a custom environment variable in EMR to be available when running a spark application.

I have tried adding this:

                   ...
                   --configurations '[                                    
                                      {
                                      "Classification": "spark-env",
                                      "Configurations": [
                                        {
                                        "Classification": "export",
                                        "Configurations": [],
                                        "Properties": { "SOME-ENV-VAR": "qa1" }
                                        }
                                      ],
                                      "Properties": {}
                                      }
                                      ]'
                   ...

and also tried to replace "spark-env with hadoop-env but nothing seems to work.

There is this answer from the aws forums. but I can't figure out how to apply it. I'm running on EMR 5.3.1 and launch it with a preconfigured step from the cli: aws emr create-cluster...

NetanelRabinowitz
  • 1,534
  • 2
  • 14
  • 26

4 Answers4

11

Add the custom configurations like below JSON to a file say, custom_config.json

[   
  {
   "Classification": "spark-env",
   "Properties": {},
   "Configurations": [
       {
         "Classification": "export",
         "Properties": {
             "VARIABLE_NAME": VARIABLE_VALUE,
         }
       }
   ]
 }
]

And, On creating the emr cluster, pass the file reference to the --configurations option

aws emr create-cluster --configurations file://custom_config.json --other-options...
franklinsijo
  • 17,784
  • 4
  • 45
  • 63
  • 1
    This should be the exact same thing as what I did in the question, isn't it ? – NetanelRabinowitz Feb 23 '17 at 09:50
  • Yes, but within a file. – franklinsijo Feb 23 '17 at 09:52
  • 4
    Sounds like a bug if Its true. Didn't tried it yet but are you sure this is what actually made the difference ??? because when I look at the EMR UI under configuration (with the version from the question)I can see my variable is set.. Its just that the spark app can't – NetanelRabinowitz Feb 23 '17 at 09:59
  • Basically, AWS' EMR does not distribute spark-env.sh to worker nodes. Google's DataProc does. YARN is a sort-of hacky way of getting environmental variables in, though. Consider just passing them as app arguments into your spark job. – yegeniy Aug 09 '18 at 15:00
5

For me replacing spark-env to yarn-env fixed issue.

Przemek
  • 208
  • 2
  • 8
  • the spark-env.sh file seems to get filled in by EMR, but maybe it doesn't get executed or something? Anyways, using yarn-env does seem to work. Maybe it's because we're running YARN, not spark directly? – yegeniy Aug 09 '18 at 01:44
  • 1
    Basically, AWS' EMR does not distribute `spark-env.sh` to worker nodes. Google's DataProc does. YARN is a sort-of hacky way of getting environmental variables in, though. Consider just passing them as arguments into your spark job. – yegeniy Aug 09 '18 at 14:58
1

Use classification yarn-env to pass environment variables to the worker nodes.

Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.

rwitzel
  • 1,694
  • 17
  • 21
0

For EMR 6.11.0, and running YARN in cluster mode, I had to use spark-defaults as per the docs.

Example of custom configurations JSON below, setting two environment variables MY_ENV_VAR and ANOTHER_ENV_VAR.

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.port": "999",  # example of other configs
      "spark.yarn.appMasterEnv.MY_ENV_VAR": "value",
      "spark.yarn.appMasterEnv.ANOTHER_ENV_VAR": "another_value",
    }
  }
]