0

I am trying to figure out how I can terminate an EMR cluster successfully once all the steps submitted to it are 'COMPLETED'|'CANCELLED'|'FAILED'|'INTERRUPTED'. There are three Lambda functions.

  • Lambda 1: Does some work and creates EMR. Triggers Lambda 2 by passing steps and cluster ID through event.
  • Lambda 2: Submits the steps received from Lambda 1 to the cluster ID received from the same.
  • Lambda 3: Submits a final step and then should send a request for termination when all steps are 'COMPLETED'|'CANCELLED'|'FAILED'|'INTERRUPTED'.

I've done till Lambda 3's step submission, but unable to do the rest.

I have successfully created EMR through:

conn = boto3.client("emr")
cluster_id = conn.run_job_flow()

submitted steps through:

conn = boto3.client("emr")
action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=event["steps"])

Now how can this termination be triggered only on the given condition? I saw the boto3 API doc which has client.terminate_job_flows(), but this function doesn't wait for the steps to finish or fail and directly hits the termination process.

Is there a way to change KeepJobFlowAliveWhenNoSteps from TRUE to FALSE when all my steps are done? Then I think it should automatically turn off the cluster. But going by the API docs, didn't find any option to change this parameter once the run_job_flow() is called.

Hope I was able to convey the issue I faced correctly. Any help?

Note: Using Python 3.8 in AWS Lambda. Each steps are Spark jobs.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Aakash Basu
  • 1,689
  • 7
  • 28
  • 57

1 Answers1

1

I agree with your research. The optimal situation would be to set KeepJobFlowAliveWhenNoSteps to FALSE to have the cluster self-terminate.

I do notice that the RunJobFlow documentation says:

If the KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.

Therefore, the Lambda function could check whether the cluster is in the WAITING state and, if so, shutdown the cluster. However, this would take repeated checking.

It might be possible to submit a final step that calls the EMR API to shutdown the cluster. This means that the cluster is effectively calling for its own termination as a final step. (I haven't tried this concept, but it would be a clean way of performing the shutdown without having to repeatedly check the status.)

There is also a similar discussion about shutting down idle clusters on this Question: How to terminate AWS EMR Cluster automatically after some time

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Hi @John, you've been my AWS saviour for more than a year now, thanks for that. I was thinking through the similar lines of your second last paragraph which will submit a final step to self terminate the EMR, but no luck finding a set of step arguments/parameters for the same. Also checked your other answer but the isIdle Boolean seems to be deprecated. Please let me know if there's a way to crack this. – Aakash Basu May 29 '20 at 14:31
  • [Monitor Metrics with CloudWatch - Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html) makes several references to the `isIdle` metric. I couldn't see any indication that it is deprecated. – John Rotenstein May 31 '20 at 05:57
  • 1
    Here's another sneaky idea... I notice that when calling `add_job_flow_steps()`, there is a parameter `'ActionOnFailure': 'TERMINATE_JOB_FLOW'|'TERMINATE_CLUSTER'`. I wonder if you could add an intentionally-bad step, and then tell it to `TERMINATE_CLUSTER` on failure? – John Rotenstein May 31 '20 at 05:58
  • Wow, that is an awesome idea. Anyway, I've built a solution already which is running as expected. Shall post an answer with the code, here. – Aakash Basu May 31 '20 at 10:33