spark-submit from outside AWS EMR cluster

Question

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode. I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?

I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...

score 1 · Accepted Answer · answered Jun 11 '18 at 21:29

1

If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.

I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.

You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.

For example:

>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop@${PUBLIC_MASTER_DNS}:  
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop@${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py

However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.

answered Jun 11 '18 at 21:29

the-ucalegon

117
5

Ok, thanks. I was actually hoping it'd be possible to interact with the cluster more directly from my terminal as it's a bit annoying to have to have to push and run code manually, but I believe that the only way to do this would be to connect to the cluster via VPN. Anyway, accepting your answer as it's definitely a method that works :-). – mm_857 Jun 12 '18 at 10:09
Having to SCP/SSH files over to the cluster and then run is a security issue. – gallamine Jul 13 '18 at 15:53
hey @gallamine its been a while, but I was hoping you could elaborate on the security risk of using SCP/SSH. I'm not an infosec guy at all so, it'd def be helpful to learn. – the-ucalegon Jun 13 '19 at 01:33

score 1 · Answer 2 · answered Jun 30 '21 at 16:09

You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:

aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE

Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/

spark-submit from outside AWS EMR cluster

2 Answers2