Set environment variables in AWS EMR during bootstrap

Question

I added the following configuration in spark-env

--configurations '[
     {
       "Classification": "spark-env",
       "Properties": {},
       "Configurations": [
           {
             "Classification": "export",
             "Properties": {
                 "MY_VARIABLE": "MY_VARIABLE"
             }
           }
       ]
     }
     ]'

But if I just do echo $MY_VARIABLE in bash I can't see them in the terminal.

Basically what I want to do is the following:

schedule the creation of an AWS EMR cluster with AWS Lambda (where I would set all my environment variables such as git credentials)
in the bootstrapping of the machine, install a bunch of things, including git
git clone a repository (so I need to use the credentials stored in the environment variables)
execute some code from this repository

The question is not a duplicate @tripleee. For clarity, I have edited the question. The linked duplicate question is specifically for env variables for spark application, these variables are not present during EMR bootstrapping. — Vishrant, Mar 20 '20 at 17:46

score 0 · Answer 1 · answered Jun 16 '20 at 12:52

0

Pass the environment variables as arguments to the bootstrap action.

answered Jun 16 '20 at 12:52

rwitzel

1,694
17
21

score 0 · Answer 2 · answered Aug 26 '20 at 09:11

the reason why you can't find MY_VARIABLE using echo is because MY_VARIABLE is only available to the spark-env.

Assuming you are using pyspark, if you open a pyspark shell (whilst you are ssh'd into one of the nodes of your cluster) and you try to type os.getenv("MY_VARIABLE") you'll see the value of you assigned to that variable.

An alternative solution for your use case would be: instead of using credentials (which in general is not the preferred way), you could use a set of keys that allows you to clone a repo with SSH (rather than https). You can store those keys in aws ssm and retrieve those in the EMR bootstrap script. An example could be:

bootstrap.sh

export SSM_VALUE=$(aws ssm get-parameter --name $REDSHIFT_DWH_PUBLIC_KEY --with-decryption --query 'Parameter.Value' --output text)
echo $SSM_VALUE >> $AUTHORIZED_KEYS

In my case, I needed to connect to a Redshift instance, but this would work nicely also with your use case.

Alessio

Set environment variables in AWS EMR during bootstrap

2 Answers2