Kill Spark Job or terminate EMR Cluster if job takes longer than expected

Question

I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.

Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.

What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.

Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.

I guess you are not using a shell script to deploy your emr cluster? Otherwise you could just create te cluster, get the Id, wait 5 hours kill the cluster if still available — Tom Lous, Jan 10 '19 at 16:50

score 1 · Answer 1 · answered Jan 11 '19 at 20:41

EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.

I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh

Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.

Kill Spark Job or terminate EMR Cluster if job takes longer than expected

1 Answers1