8

I added the following configuration in spark-env

--configurations '[
     {
       "Classification": "spark-env",
       "Properties": {},
       "Configurations": [
           {
             "Classification": "export",
             "Properties": {
                 "MY_VARIABLE": "MY_VARIABLE"
             }
           }
       ]
     }
     ]'

But if I just do echo $MY_VARIABLE in bash I can't see them in the terminal.

Basically what I want to do is the following:

  • schedule the creation of an AWS EMR cluster with AWS Lambda (where I would set all my environment variables such as git credentials)
  • in the bootstrapping of the machine, install a bunch of things, including git
  • git clone a repository (so I need to use the credentials stored in the environment variables)
  • execute some code from this repository
David Medinets
  • 5,160
  • 3
  • 29
  • 42
Pierre
  • 938
  • 1
  • 15
  • 37

2 Answers2

0

Pass the environment variables as arguments to the bootstrap action.

rwitzel
  • 1,694
  • 17
  • 21
0

the reason why you can't find MY_VARIABLE using echo is because MY_VARIABLE is only available to the spark-env.

Assuming you are using pyspark, if you open a pyspark shell (whilst you are ssh'd into one of the nodes of your cluster) and you try to type os.getenv("MY_VARIABLE") you'll see the value of you assigned to that variable.

An alternative solution for your use case would be: instead of using credentials (which in general is not the preferred way), you could use a set of keys that allows you to clone a repo with SSH (rather than https). You can store those keys in aws ssm and retrieve those in the EMR bootstrap script. An example could be:

bootstrap.sh

export SSM_VALUE=$(aws ssm get-parameter --name $REDSHIFT_DWH_PUBLIC_KEY --with-decryption --query 'Parameter.Value' --output text)
echo $SSM_VALUE >> $AUTHORIZED_KEYS

In my case, I needed to connect to a Redshift instance, but this would work nicely also with your use case.

Alessio

AlessioG
  • 576
  • 5
  • 13
  • 32