13

I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: Cluster Scheduled Deletion

Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long-running jar which will sit on the EMR Master node and just poll yarn for some idle time metric and then shut down the cluster after a set period of time?

Edit: For more clarification. I would like some functionality wherein the cluster is terminated based on idle for some x amount of time. e.g. If the cluster has been up for a while but no jobs have been run for say 1 hour and the cluster is just sitting there doing nothing, then I'd like the ability to terminate the cluster.

Abdullah Khawer
  • 4,461
  • 4
  • 29
  • 66
h0mer
  • 353
  • 1
  • 4
  • 10
  • Could you please clarify *how* you wish to determine when to terminate? Is it at a certain time, or after *x* hours, or is it after a period of idle time where the cluster is not running any jobs, or some other method? – John Rotenstein Apr 12 '18 at 22:13
  • Added some more clarification to the original post. Let me know if that helps. Basically I'd like to implement the Google Dataproc "Cluster Scheduled Deletion" functionality in EMR in some fashion. – h0mer Apr 12 '18 at 22:38
  • This is an old question but if someone needs an answer to this note that now AWS offers an auto termination mechanism. More at https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-auto-termination-policy.html – bogdan.rusu Jan 06 '23 at 12:34

3 Answers3

8

The easiest method would be used to Amazon EMR Metrics and Dimensions for Amazon CloudWatch. There is an isIdle boolean that "indicates that a cluster is no longer performing work".

You could create a CloudWatch Alarm that says if it is True for more than x minutes, then trigger the alarm. This would send a message to Amazon SNS, which can trigger a Lambda function to shutdown the cluster.

Components:

  • Amazon CloudWatch Alarm
  • Amazon SNS queue
  • AWS Lambda function

Update: This apparently isn't suitable (see comments below).

An alternate method would be:

  • Use Amazon CloudWatch Events to schedule a Lambda function every x seconds
  • The Lambda function looks for any clusters with a particular tag that indicates how long to wait until shutdown (eg 40 minutes). If the tag is not present, the cluster remains untouched.
  • The Lambda function queries the cluster state (somehow -- probably via a Hadoop API call), then:
    • If the cluster is idle and there is no Idle Since tag, add an Idle Since tag with the current timestamp
    • If the cluster is idle and it been more than x minutes since the timestamp in the Idle Since tag, terminate the cluster.
    • If the cluster is not idle, remove the Idle Since tag (if present)
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 2
    So after reading into the Cloudwatch alarms (specifically the isIdle metric), it says that it only checks the cluster once every 5 minutes and that the check is only if it is idle at that specific time when checked. This does not mean that for the entire past 5 minutes (before the check), the cluster was idle. Some of the jobs we schedule last for only 3-4 minutes, so there is a possiblility that the cluster was in use, but the 'isIdle' check still returns true after that job finished. Any ideas? – h0mer Apr 13 '18 at 05:51
  • 3
    Tried this out. This does work. Using the YARN rest api to get me the list of jobs and when they ran works. I then just sort by the DTG of the jobs and get last job that was submitted. And using that and the current DTG I'm able to kick off another method to shutdown the EMR cluster using the AWS EMR SDK api. Thanks for the help. I just wish Amazon would add this functionality built-in like Google does. Because of the costs associated with long running clusters, it'd make sense to have a idle-timeout function that kills the cluster if idle for more than some x amount of time. – h0mer Apr 14 '18 at 00:03
7

Keeping in mind the clarification that you have provided in your question, there could be 3 possible ways to do that.

1) Using AWS CloudWatch metric isIdle of an EMR cluster. This metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes. Reference: https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html

2) [Recommended] Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters. You can achieve visibility on the AWS Console level and can easily enable and disable it.

[Recommended] Solution using 2nd Approach

Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.

You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.

AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.

You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.

AWS Lambda function is using Python 3.7 as its runtime environment.

You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr

Note: Any contributions, improvements, and suggestions to this solution that I developed will be highly appreciated.

3) Some other custom solution based on a Shell that runs against a CRON job on an EMR cluster's master node but you will lose its visibility on the AWS Console level and you may require SSH access as well.

Abdullah Khawer
  • 4,461
  • 4
  • 29
  • 66
1

I had to do a similar implementation and just considering Cluster Elapsed time was not solving our problem.

so we came up with a approach to hit the Hadoop API, you can find them here

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API

So here is what we did,

  1. Ask the user who brings up a cluster to add a Tag like "AutoShutDown":"True:BufferMinutes", here "AutoShutDown" is the key and "True:BufferMinutes" is the value of the Tag

  2. Here BufferMinutes is the time in minutes (30, 60 etc.)

  3. create a Lambda to hit the hadoop api of all those clusters configured with step 1 (if the user does not add the Tag then the cluster is untouched) and fetch the end time of the last job that was completed (only if all jobs are either completed / terminated), if any job is still running then do nothing and exit.

  4. now

    datetime_difference = (current_time - lastFinished) if(datetime_difference > requested_time) { terminate_cluster }

  5. Create a cloud watch trigger and add the lambda created as target to it, schedule the trigger to run as required.

Note: Lambda is written in python, so boto3 is used and client will be "emr" same like what abdullahkhawer mentioned in his solution above.

This implementation gives flexibility to the user to choose and reduces a great deal of burden on dev-ops.

Arun Mohan
  • 349
  • 4
  • 13