4

We are thinking to migrate our Hadoop infrastructure from Data Center to AWS EMR. As some of the tasks / stages in ETL process are dependent e.g. flow is like

  1. Map Reduce job will generate data
  2. Shell script will move the data generated in step 1 to the output location

In EMR, we could find steps for Custom Jar, Pig, Hive, but did not find option to execute shell script. Few options we have to overcome this is,

  • We can write the shell script logic in java program and add custom jar step.
  • Bootstrap action. But as our requirement is to execute the shell script after the step 1 is complete, I am not sure whether it will be useful.

Rather than reinventing the wheel, if any other option which is directly available from EMR or AWS which fulfil our requirement, then our efforts would be reduced.

Free Coder
  • 41
  • 1
  • 1
  • 4

2 Answers2

4

Please refer to the link: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html

aws emr create-cluster --name "Test cluster" –-release-label  --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]
Kiran Thati
  • 141
  • 2
0

For running the shell script via steps we can still use command-runner.jar and pass the absolute path to the script as follows:

**JAR location** : command-runner.jar
**Arguments** : bash /home/hadoop/script_name.sh or bash /path_to_script/script_name.sh

{
  'Name': 'run_script',
  'ActionOnFailure': 'CANCEL_AND_WAIT',
  'HadoopJarStep': {
     'Jar': 'command-runner.jar',
     'Args': [
           "bash","/home/hadoop/script_name.sh"
     ]
}