Airflow/Amazon EMR: The VPC/subnet configuration was invalid: Subnet is required : The specified instance type m5.xlarge can only be used in a VPC

Question

I want to create an emr cluster triggered via Airflow on Amazon EMR. The emr cluster shows up in the UI of Amazon EMR but with an error saying: "The VPC/subnet configuration was invalid: Subnet is required : The specified instance type m5.xlarge can only be used in a VPC"

Below is the code snippet and the config details in json format for this task that are used in the Airflow script.

My question is how can I incorporate the information (id codes) about VPC and subnet in the json (if this is even possible)? there are no explicit examples out there.

Hint: a network and an EC2 subnet is already created

JOB_FLOW_OVERRIDES = {
    "Name": "sentiment_analysis",
    "ReleaseLabel": "emr-5.33.0",
    "Applications": [{"Name": "Hadoop"}, {"Name": "Spark"}], # We want our EMR cluster to have HDFS and Spark
    "Configurations": [
        {
            "Classification": "spark-env",
            "Configurations": [
                {
                    "Classification": "export",
                    "Properties": {"PYSPARK_PYTHON": "/usr/bin/python3"}, # by default EMR uses py2, change it to py3
                }
            ],
        }
    ],
    "Instances": {
        "InstanceGroups": [
            {
                "Name": "Master node",
                "Market": "SPOT",
                "InstanceRole": "MASTER",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 1,
            },
            {
                "Name": "Core - 2",
                "Market": "SPOT", # Spot instances are a "use as available" instances
                "InstanceRole": "CORE",
                "InstanceType": "m5.xlarge",
                "InstanceCount": 2,
            },
        ],
        "KeepJobFlowAliveWhenNoSteps": True,
        "TerminationProtected": False, # this lets us programmatically terminate the cluster
    },
    "JobFlowRole": "EMR_EC2_DefaultRole",
    "ServiceRole": "EMR_DefaultRole",
}

create_emr_cluster = EmrCreateJobFlowOperator(
    task_id="create_emr_cluster",
    job_flow_overrides=JOB_FLOW_OVERRIDES,
    aws_conn_id="aws_default",
    emr_conn_id="emr_default",
    dag=dag,
)

Have you tried using `'Ec2SubnetId'` within the `Instances` object? Source: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.run_job_flow I see it's a different library from the one you are using but many fields are very similar, so it might be the case that works fine for you as well — Vzzarr, May 20 '21 at 15:57

score 4 · Answer 1 · answered Jul 23 '21 at 12:46

EmrCreateJobFlowOperator calls create_job_flow from emr.py which matches the same api from boto3 emr client.

Therefore you can put an item "Ec2SubnetId" with your subnet id as value within the "Instances" dictionary.

This works for me on Apache Airflow 2.0.2

Airflow/Amazon EMR: The VPC/subnet configuration was invalid: Subnet is required : The specified instance type m5.xlarge can only be used in a VPC

1 Answers1