3

I am trying to run a Jar file for spark job in data pipeline, but I am not sure what I exactly need to pass in EMR step?

Monika Patel
  • 35
  • 1
  • 6

1 Answers1

3

EMR Step is the place you describe how do you want to submit the spark jar.

When you create a new datapipeline you can choose the option of "build using template" and then pick "run job on an elastic MapReduce cluster".

now in the EmrActivity you should describe the step you want to submit (you can also run multiple steps if you want).

you can read this AWS EMR Spark Step Guide to understand what a step is. in short it the place where you describe how to submit the spark job.

Pay attention though that on datapipeline for some obscure reason you need to replace spaces with ',' on the step. here is an example of a spark step I ran on datapipeline:

command-runner.jar,spark-submit,--deploy-mode,cluster,--class,com.exelate.main.App,--master,yarn-cluster,--name,<spark job name>,--num-executors,1000,--driver-cores,2,--driver-memory,10g,--executor-memory,16g,--executor-cores,4,<jar location on s3>,<jar arguments>

I left some of my configuration so that you can understand where to use them and I replaced some with <"text"> so that you could switch with your own information

Tal Joffe
  • 5,347
  • 4
  • 25
  • 31
  • Thank you! It helped us writing step for our data pipeline. We actually had couple of issues, our EMR spark instance was not compatible with our job. It took a while but we were able to figure it out. – Monika Patel Aug 15 '17 at 19:22
  • Do dependencies work as usual with Spark jobs? When I create a step with CLI, it doesn't wait for the job to finish and returns immediately. Does Data Pipeline monitor the job until it finishes? – lfk Apr 24 '18 at 04:11
  • yes it is a pipeline so you can pipe events. if you use an emr resource for the EMR activity the pipeline will also terminate it when it finishes. But testing this is really simple.. just try and see. – Tal Joffe Apr 24 '18 at 07:31