How do you automate pyspark jobs on emr using boto3 (or otherwise)?

Question

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database.

My job flow is as follows:

Grab the log data from S3
Either use spark dataframes or spark sql to parse the data and write back out to S3
Upload the data from S3 to Redshift.

I'm getting hung up on how to automate this though so that my process spins up an EMR cluster, bootstraps the correct programs for installation, and runs my python script that will contain the code for parsing and writing.

Does anyone have any examples, tutorials, or experience they could share with me to help me learn how to do this?

There is now a tutorial from AWS themselves https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/ . We ended up throwing away Cloudformation and reused a decent portion of Python/Spark/Livy stuff. — Pranasas, Oct 18 '18 at 08:35
hello, i have similar requirement. how did u approach or solve your problem — akash sharma, Jul 18 '20 at 08:58

Kamil Sindi · Accepted Answer · 2016-06-16T17:56:11.287

Take a look at boto3 EMR docs to create the cluster. You essentially have to call run_job_flow and create steps that runs the program you want.

import boto3    

client = boto3.client('emr', region_name='us-east-1')

S3_BUCKET = 'MyS3Bucket'
S3_KEY = 'spark/main.py'
S3_URI = 's3://{bucket}/{key}'.format(bucket=S3_BUCKET, key=S3_KEY)

# upload file to an S3 bucket
s3 = boto3.resource('s3')
s3.meta.client.upload_file("myfile.py", S3_BUCKET, S3_KEY)

response = client.run_job_flow(
    Name="My Spark Cluster",
    ReleaseLabel='emr-4.6.0',
    Instances={
        'MasterInstanceType': 'm4.xlarge',
        'SlaveInstanceType': 'm4.xlarge',
        'InstanceCount': 4,
        'KeepJobFlowAliveWhenNoSteps': True,
        'TerminationProtected': False,
    },
    Applications=[
        {
            'Name': 'Spark'
        }
    ],
    BootstrapActions=[
        {
            'Name': 'Maximize Spark Default Config',
            'ScriptBootstrapAction': {
                'Path': 's3://support.elasticmapreduce/spark/maximize-spark-default-config',
            }
        },
    ],
    Steps=[
    {
        'Name': 'Setup Debugging',
        'ActionOnFailure': 'TERMINATE_CLUSTER',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['state-pusher-script']
        }
    },
    {
        'Name': 'setup - copy files',
        'ActionOnFailure': 'CANCEL_AND_WAIT',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['aws', 's3', 'cp', S3_URI, '/home/hadoop/']
        }
    },
    {
        'Name': 'Run Spark',
        'ActionOnFailure': 'CANCEL_AND_WAIT',
        'HadoopJarStep': {
            'Jar': 'command-runner.jar',
            'Args': ['spark-submit', '/home/hadoop/main.py']
        }
    }
    ],
    VisibleToAllUsers=True,
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole'
)

You can also add steps to a running cluster if you know the job flow id:

job_flow_id = response['JobFlowId']
print("Job flow ID:", job_flow_id)

step_response = client.add_job_flow_steps(JobFlowId=job_flow_id, Steps=SomeMoreSteps)

step_ids = step_response['StepIds']

print("Step IDs:", step_ids)

For more configurations, check out sparksteps.

I am not able to understand the value and significance of the S3_KEY value being a python file.What does it do/? — Anandhu Ajayakumar, Jan 08 '18 at 10:55
The S3 key is the PySpark file / job you want run. One of the steps copies it from S3 to your cluster. It doesn't have to be a python file. Could be Scala if you're executing a Scala job. — Kamil Sindi, Jan 08 '18 at 13:30
This creates a jobflow id but it doesn't show up in the EMR console :-? — CpILL, Sep 04 '18 at 13:48
The ' ScriptBootstrapAction' mentioned in the script above is/should no longer be needed: see https://github.com/aws-samples/emr-bootstrap-actions/blob/master/spark/README.md — Marco, Jan 10 '19 at 17:17

score 2 · Answer 2 · answered May 18 '18 at 18:17

2

Just do this using AWS Data Pipeline. You can setup your S3 bucket to trigger a lambda function every time a new file is placed inside the bucket https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html. Then your Lambda function will activate your Data Pipeline https://aws.amazon.com/blogs/big-data/using-aws-lambda-for-event-driven-data-processing-pipelines/ then your Data Pipeline spins up a new EMR Cluster using EmrCluster, then you can specify your bootstrap options, then you can run your EMR commands using EmrActivity, and when it's all done it'll terminate your EMR Cluster and deactivate the Data Pipeline.

answered May 18 '18 at 18:17

Kyle Bridenstine

6,055
11
62
100

2

Tried this. Fails without writing logs :( Another half baked AWS application that is just a wrapper for Lambda functions :-/ – CpILL Aug 31 '18 at 08:41
1

@CpILL yup although I promoted AWS Data Pipeline here I did not go with it for my own use. After evaluating it I didn’t think it was robust enough so I went all in on Apache Airflow. Haven’t looked back since :) but AWS Data Pipeline is ok for small, simple things. I created a few proof-of-concept apps with the architecture listed in my answer and it worked ok after banging my head on the wall for a few hours getting it to work :) – Kyle Bridenstine Aug 31 '18 at 12:52
Looks good. Will have a look at it once it leaves incubation. Also, I need an AWS solution for now :-/ – CpILL Sep 03 '18 at 09:56
Also I just thought of something. Sometimes the AWS logs just take a little while to be written. For example, when I run Spark commands on an EMR sometimes the command will fail. Well it takes another three minutes or so before the logs get written so I have to wait to see why the command failed. So if this is the case for you when you used the AWS Data Pipeline then just refresh the logs until you see them come in. There are also logs in different places e.g., you'd have your Data Pipeline logs but if you spun up an EMR through Pipeline then those logs would be in S3 elsewhere. – Kyle Bridenstine Sep 04 '18 at 17:23
If you still need a solution then I think you should just go with my answer. The only thing you need Data Pipeline to do is spin up the EMR cluster and run your command. That's not complex at all so you shouldn't be burdened by Data Pipeline's lack of features and robustness. I'm almost tempted to say you could do this with just S3, Lambda, and EMR. S3 trigger starts the lambda when a new file comes in, lambda uses boto3 to create a new EMR with your hadoop step (EMR auto terminate set to true). The only thing is if your EMR step fails then you wouldn't know since the lambda would be shutdown. – Kyle Bridenstine Sep 04 '18 at 17:28

score 1 · Answer 3 · answered Sep 16 '18 at 18:57

1

Actually, I've gone with AWS's Step Functions, which is a state machine wrapper for Lambda functions, so you can use boto3 to start the EMR Spark job using run_job_flow and you can use describe_cluaster to get the status of the cluster. Finally use a choice. SO your step functions look something like this (step function types in brackets:

Run job (task) -> Wait for X min (wait) -> Check status (task) -> Branch (choice) [ => back to wait, or => done ]

answered Sep 16 '18 at 18:57

CpILL

6,169
5
38
37

so basically, u write a lambda function which uses boto3 to spin up a cluster and add a step to run the python script ( to do the processing ) ? – akash sharma Jul 18 '20 at 09:00
yeah, step functions or something like Airflow to handle the "orchestration" – CpILL Jul 20 '20 at 21:12

score 0 · Answer 4 · answered Aug 25 '20 at 19:49

I put a complete example on GitHub that shows how to do all of this with Boto3.

The long-lived cluster example shows how to create and run job steps on a cluster that grabs data from a public S3 bucket that contains historical Amazon review data, do some PySpark processing on it, and write the output back to an S3 bucket.

Creates an Amazon S3 bucket and uploads a job script.
Creates AWS Identity and Access Management (IAM) roles used by the demo.
Creates Amazon Elastic Compute Cloud (Amazon EC2) security groups used by the demo.
Creates short-lived and long-lived clusters and runs job steps on them.
Terminates clusters and cleans up all resources.

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

4 Answers4

Linked