For the sake of an example, let's assume you need librosa
Python module on running EMR cluster. We're going to use Python 2.7 as the procedure is simpler - Python 2.7 is guaranteed to be on the cluster and that's the default runtime for EMR.
Create a script that installs the package:
#!/bin/bash
sudo easy_install-2.7 pip
sudo /usr/local/bin/pip2 install librosa
and save it to your home directory, e.g. /home/hadoop/install_librosa.sh
. Note the name, we're going to use it later.
In the next step you're going to run this script through another script inspired by Amazon EMR docs: emr_install.py
. It uses AWS Systems Manager to execute your script over the nodes.
import time
from boto3 import client
from sys import argv
try:
clusterId=argv[1]
except:
print("Syntax: emr_install.py [ClusterId]")
import sys
sys.exit(1)
emrclient=client('emr')
# Get list of core nodes
instances=emrclient.list_instances(ClusterId=clusterId,InstanceGroupTypes=['CORE'])['Instances']
instance_list=[x['Ec2InstanceId'] for x in instances]
# Attach tag to core nodes
ec2client=client('ec2')
ec2client.create_tags(Resources=instance_list,Tags=[{"Key":"environment","Value":"coreNodeLibs"}])
ssmclient=client('ssm')
# Run shell script to install libraries
command=ssmclient.send_command(Targets=[{"Key": "tag:environment", "Values":["coreNodeLibs"]}],
DocumentName='AWS-RunShellScript',
Parameters={"commands":["bash /home/hadoop/install_librosa.sh"]},
TimeoutSeconds=3600)['Command']['CommandId']
command_status=ssmclient.list_commands(
CommandId=command,
Filters=[
{
'key': 'Status',
'value': 'SUCCESS'
},
]
)['Commands'][0]['Status']
time.sleep(30)
print("Command:" + command + ": " + command_status)
To run it:
python emr_install.py [cluster_id]