2

I'm trying to write a component that will start up an EMR cluster, run a Spark pipeline on that cluster, and then shut that cluster down once the pipeline completes.

I've gotten as far as creating the cluster and setting permissions to allow my main cluster's worker machines to start EMR clusters. However, I'm struggling with debugging the created cluster and waiting until the pipeline has concluded. Here is the code I have now. Note I'm using Spark Scala, but this is very close to standard Java code:

val runSparkJob = new StepConfig()
  .withName("Run Pipeline")
  .withActionOnFailure(ActionOnFailure.TERMINATE_CLUSTER)
  .withHadoopJarStep(
    new HadoopJarStepConfig()
      .withJar("/path/to/jar")
      .withArgs(
        "spark-submit",
        "etc..."
      )
  )

// Create a cluster and run the Spark job on it
val clusterName = "REDACTED Cluster"
val createClusterRequest =
  new RunJobFlowRequest()
    .withName(clusterName)
    .withReleaseLabel(Configs.EMR_RELEASE_LABEL)
    .withSteps(enableDebugging, runSparkJob)
    .withApplications(new Application().withName("Spark"))
    .withLogUri(Configs.LOG_URI_PREFIX)
    .withServiceRole(Configs.SERVICE_ROLE)
    .withJobFlowRole(Configs.JOB_FLOW_ROLE)
    .withInstances(
      new JobFlowInstancesConfig()
        .withEc2SubnetId(Configs.SUBNET)
        .withInstanceCount(Configs.INSTANCE_COUNT)
        .withKeepJobFlowAliveWhenNoSteps(false)
        .withMasterInstanceType(Configs.MASTER_INSTANCE_TYPE)
        .withSlaveInstanceType(Configs.SLAVE_INSTANCE_TYPE)
    )

val newCluster = emr.runJobFlow(createClusterRequest)

I have two concrete questions:

  1. The call to emr.runJobFlow returns immediately upon submitting the result. Is there any way that I can make it block until the cluster is shut down or otherwise wait until the workflow has concluded?

  2. My cluster is actually not coming up and when I go to the AWS Console -> EMR -> Events view I see a failure:

    Amazon EMR Cluster j-XXX (REDACTED...) has terminated with errors at 2019-06-13 19:50 UTC with a reason of VALIDATION_ERROR.

Is there any way I can get my hands on this error programmatically in my Java/Scala application?

user3190018
  • 890
  • 13
  • 26
alexgolec
  • 26,898
  • 33
  • 107
  • 159

1 Answers1

3

Yes, it is very possible to wait until an EMR cluster is terminated.

There is are waiters that will block execution until the cluster (i.e. job flow) gets to a certain state.

val newCluster = emr.runJobFlow(createClusterRequest);
val describeRequest = new DescribeClusterRequest()
    .withClusterId(newCluster.getClusterId())

// Wait until terminated
emr.waiters().clusterTerminated().run(new WaiterParameters(describeRequest))

Also, if you want to get the status of the cluster (i.e. job flow), you can call the describeCluster function of the EMR client. Check out the linked documentation as you can get state and status information about the cluster to determine if it's successful or erred.

val result = emr.describeCluster(describeRequest)

Note: Not the best Java-er so the above is my best guess and how it would work based on the documentation but I have not tested the above.

JD D
  • 7,398
  • 2
  • 34
  • 53
  • thanks, but this seems to be a bit of functionality in Javascript, whereas I'm writing in Java/Scala. – alexgolec Jun 14 '19 at 17:39
  • Oopsy, I’ll update the answer in a bit to include how to do it in scale – JD D Jun 14 '19 at 17:41
  • 1
    I updated it, the Java SDK has the same type of functionality. I tried to update it for Java but it is not tested and may need to be tweaked, feel free to edit – JD D Jun 14 '19 at 18:01