Execute shell script as one of the steps on EMR AWS

Question

We are thinking to migrate our Hadoop infrastructure from Data Center to AWS EMR. As some of the tasks / stages in ETL process are dependent e.g. flow is like

Map Reduce job will generate data
Shell script will move the data generated in step 1 to the output location

In EMR, we could find steps for Custom Jar, Pig, Hive, but did not find option to execute shell script. Few options we have to overcome this is,

We can write the shell script logic in java program and add custom jar step.
Bootstrap action. But as our requirement is to execute the shell script after the step 1 is complete, I am not sure whether it will be useful.

Rather than reinventing the wheel, if any other option which is directly available from EMR or AWS which fulfil our requirement, then our efforts would be reduced.

score 4 · Answer 1 · answered Feb 10 '17 at 04:58

Please refer to the link: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

aws emr create-cluster --name "Test cluster" –-release-label  --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]

score 0 · Answer 2 · answered Mar 13 '19 at 15:17

For running the shell script via steps we can still use command-runner.jar and pass the absolute path to the script as follows:

**JAR location** : command-runner.jar
**Arguments** : bash /home/hadoop/script_name.sh or bash /path_to_script/script_name.sh

{
  'Name': 'run_script',
  'ActionOnFailure': 'CANCEL_AND_WAIT',
  'HadoopJarStep': {
     'Jar': 'command-runner.jar',
     'Args': [
           "bash","/home/hadoop/script_name.sh"
     ]
}

Execute shell script as one of the steps on EMR AWS

2 Answers2