2

I use AWS EMR for our spark streaming. I add a step in EMR that reads data from Kinesis stream. What I need is an approach to stop this step and add a new one.

Right now I spawn a thread from the Spark driver and listen to an SQS queue for a message and upon receiving a message, I call sparkContext.stop(). I use Chef for our deployment automation. So when there is a new artifact, a message is put into SQS, EMR reads it and stops the step. Chef then adds a new step with EMR API.

My question is, is this the right way to stop a long running streaming job in EMR? How will this be handled had spark been deployed on a standalone cluster instead of EMR?

zero323
  • 322,348
  • 103
  • 959
  • 935
Aravindh S
  • 1,185
  • 11
  • 19

1 Answers1

0

EMR STEP API at this moment, does not support STOPPING. When you submit a step , EMR usually runs a hadoop jar command with he arguments you provided. If the step type is spark, it will run spark-submit command. When this command returns 0 exit code , a STEP is marked as FINISHED and when it returns any other exit code , it will be marked as FAILED. The state can also depend on the current YARN applications running. At present, You can test and observe that EMR will not mark a STEP FINISHED if there's any ongoing YARN apps(not particularly spawned by STEP) running during that time.

So, using your main class / JAR on spark-submit, by writing a custom code, you can reach the desired state of the STEP by pushing the desired exit code.

You can find more about the command that EMR translates your STEP to, by looking at the controller.log of the STEP.

jc mannem
  • 2,293
  • 19
  • 23