3

Ok, this question is where I reached after trying out some stuff. I'll first give a brief intro to what I wanted to do and how I got here.

I'm writing a script to start an EMR cluster using Java AWS SDK. The EMR cluster is to be started inside a VPC and a subnet with a certain id. When I specify the subnet id (code line below ending with // ******) the emr cluster stays in the STARTING state and does not move ahead for several minutes, eventually giving up and failing. I'm not sure if there's a bug with the implementation of this functionality in the SDK.

try {
        /**
         * Specifying credentials
         */
        String accessKey = EmrUtils.ACCESS_KEY;
        String secretKey = EmrUtils.SECRET_ACCESS_KEY;
        AWSCredentials credentials = new BasicAWSCredentials(accessKey,
            secretKey);

        /**
         * Initializing emr client object
         */
        emrClient = new AmazonElasticMapReduceClient(credentials);

        emrClient.setEndpoint(EmrUtils.ENDPOINT);

        /**
         * Specifying bootstrap actions
         */
        ScriptBootstrapActionConfig scriptBootstrapConfig = new ScriptBootstrapActionConfig();
        scriptBootstrapConfig.setPath("s3://bucket/bootstrapScript.sh");
        BootstrapActionConfig bootstrapActions = new BootstrapActionConfig(
            "Bootstrap Script", scriptBootstrapConfig);


        RunJobFlowRequest jobFlowRequest = new RunJobFlowRequest()
            .withName("Java SDK EMR cluster")
            .withLogUri(EmrUtils.S3_LOG_URI)
            .withAmiVersion(EmrUtils.AMI_VERSION)
            .withBootstrapActions(bootstrapActions)
            .withInstances(
                new JobFlowInstancesConfig()
                    .withEc2KeyName(EmrUtils.EC2_KEY_PAIR)
                    .withHadoopVersion(EmrUtils.HADOOP_VERSION)
                    .withInstanceCount(1)
                    .withEc2SubnetId(EmrUtils.EC2_SUBNET_ID) // ******
                    .withKeepJobFlowAliveWhenNoSteps(true)
                    .withMasterInstanceType(EmrUtils.MASTER_INSTANCE_TYPE)
                    .withTerminationProtected(true)
                    .withSlaveInstanceType(EmrUtils.SLAVE_INSTANCE_TYPE));

        RunJobFlowResult result = emrClient.runJobFlow(jobFlowRequest);
        String jobFlowId = result.getJobFlowId();
        System.out.println(jobFlowId);

    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("Shutting down cluster");
        if (emrClient != null) {
            emrClient.shutdown();
        }
    }

When I do the same thing using the EMR console, the cluster starts, bootstraps and successfully goes into the WAITING state. Is there any other way I can specify the subnet id to start a cluster. I suppose boto allows us to send additional parameters as a string. I found something similar in Java: .withAdditionalInfo(additionalInfo) which is a method of RunJobFlowRequest() and takes a JSON string as an argument. I don't however know the key that should be used for the ec2 subnet id in the JSON string.

(Using python boto is not an option for me, I've faced other showstopping issues with that and had to shift to AWS Java SDK)

gaurav
  • 360
  • 3
  • 8
  • OR how do I use .withAdditionalInfo(additionalInfo) method of RunJobFlowRequest object? – gaurav Jul 07 '14 at 20:57
  • 1
    I tried starting an EMR cluster on a newly created VPC, and it worked. The cluster went into Running and then Waiting states after Bootstrapping. Now looking into what could be wrong with the earlier VPC. – gaurav Jul 08 '14 at 00:23

0 Answers0