16

I'm having a surprisingly hard time working with additional libraries via my EMR notebook. The AWS interface for EMR allows me to create Jupyter notebooks and attach them to a running cluster. I'd like to use additional libraries in them. SSHing into the machines and installing manually as ec2-user or root will not make the libraries available to the notebook, as it apparently uses the livy user. Bootstrap actions install things for hadoop. I can't install from the notebook because its user apparently doesn't have sudo, git, etc., and it probably wouldn't install to the slaves anyway.

What is the canonical way of installing additional libraries for notebooks created through the EMR interface?

Walrus the Cat
  • 2,314
  • 5
  • 35
  • 64

4 Answers4

2

For the sake of an example, let's assume you need librosa Python module on running EMR cluster. We're going to use Python 2.7 as the procedure is simpler - Python 2.7 is guaranteed to be on the cluster and that's the default runtime for EMR.

Create a script that installs the package:

#!/bin/bash
sudo easy_install-2.7 pip
sudo /usr/local/bin/pip2 install librosa

and save it to your home directory, e.g. /home/hadoop/install_librosa.sh. Note the name, we're going to use it later.

In the next step you're going to run this script through another script inspired by Amazon EMR docs: emr_install.py. It uses AWS Systems Manager to execute your script over the nodes.

import time
from boto3 import client
from sys import argv

try:
  clusterId=argv[1]
except:
  print("Syntax: emr_install.py [ClusterId]")
  import sys
  sys.exit(1)

emrclient=client('emr')

# Get list of core nodes
instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances']
instance_list=[x['Ec2InstanceId'] for x in instances]

# Attach tag to core nodes
ec2client=client('ec2')
ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])

ssmclient=client('ssm')

    # Run shell script to install libraries

command=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}],
                               DocumentName='AWS-RunShellScript',
                               Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]},
                               TimeoutSeconds=3600)['Command']['CommandId']

command_status=ssmclient.list_commands(
  CommandId=command,
  Filters=[
      {
          'key': 'Status',
          'value': 'SUCCESS'
      },
  ]
)['Commands'][0]['Status']

time.sleep(30)

print("Command:" + command + ": " + command_status)

To run it:

python emr_install.py [cluster_id]

Lukasz Tracewski
  • 10,794
  • 3
  • 34
  • 53
2

What is the canonical way of installing additional libraries for notebooks created through the EMR interface?

EMR Notebooks recently launched 'notebook-scoped libraries' using which you can install additional Python libraries on your cluster from public or private PyPI repository and use it within notebook session.

Notebook-scoped libraries provide the following benefits:

  • You can use libraries in an EMR notebook without having to re-create the cluster or re-attach the notebook to a cluster.
  • You can isolate library dependencies of an EMR notebook to the individual notebook session. The libraries installed from within the notebook cannot interfere with other libraries on the cluster or libraries installed within other notebook sessions.

Here are more details, https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks-scoped-libraries.html

Technical blog: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Parag Chaudhari
  • 328
  • 2
  • 11
1

What I usually do in this case is deleting my cluster and creating a new one with bootstrap actions. Bootstrap actions allow you to install additional libraries on your cluster : https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html. For example writing the following script and saving it in S3 will allow you to use datadog from your notebook running on top of your cluster (at least it works with EMR 5.19) :

#!/bin/bash -xe
#install datadog module for using in pyspark
sudo pip-3.4 install -U datadog

Here is the command line I would run for launching this cluster :

aws emr create-cluster --release-label emr-5.19.0 \
--name 'EMR 5.19 test' \
--applications Name=Hadoop Name=Spark Name=Hive Name=Livy \
--use-default-roles \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large \
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large \
--region eu-west-1 \
--log-uri s3://<path-to-logs> \
--configurations file://config-emr.json \
--bootstrap-actions Path=s3://<path-to-bootstrap-in-aws>,Name=InstallPythonModules

And the config-emr.json that is stored locally on your computer :

[{
    "Classification": "spark",
    "Properties": {
    "maximizeResourceAllocation": "true"
    }
},
{
    "Classification": "spark-env",
    "Configurations": [
    {
        "Classification": "export",
        "Properties": {
            "PYSPARK_PYTHON": "/usr/bin/python3"
        }
    }
    ]
}]   

I assume that you could do exactly the same thing when creating a cluster through the EMR interface by going to the advanced options of creation.

Sacha
  • 260
  • 3
  • 10
0

I spent way to long on this, AWS documentation or support did not help at all, but did get it working so that you can install libraries for python directly in the notebook.

If you can do the items below then you can install libraries though running pip install commands in a single line Jupyter cell, with the Python runtime, like so

!pip install pandas

One item that confused me a lot was that I could SSH into the cluster and reach out to the internet, ping and pip would both work, but then the notebook was not able to reach out nor were any libraries actually available. Instead you need to make sure that the notebook can reach out. One good test is just to see if you can ping out. Same structure as above, single line starting with !

!ping google.com

If that is taking too long and timing out then you still need to figure out your VPN/subnet rules.

Notes below on cluster creation:

  • (step 1) This does not work for every version of EMR. I have it working on 5.30.0, but last I checked 5.30.1 did not work.
  • (step 2 -> Networking) You need to make sure you're on a private subnet and your VPN can reach out to the public internet. Again, don't let SHHing into the server fool you, the notebook is either inside a docker image there or running somewhere else. The only relevant tests are the ones you're running directly from the notebook.

Once you have this working and install a package it will work for any notebook on that cluster. I have a notebook just called install that has one line per package that I run through whenever I spin up a new cluster.

Soshmo
  • 86
  • 1
  • 2