Submitting a pyspark job to Amazon EMR cluster from terminal

Question

I have SSH-ed into the Amazon EMR server and I want to submit a Spark job ( a simple word count file and a sample.txt are both on the Amazon EMR server ) written in Python from the terminal. How do I do this and what's the syntax?

The word_count.py is as follows:

from pyspark import SparkConf, SparkContext

from operator import add
import sys
## Constants
APP_NAME = " HelloWorld of Big Data"
##OTHER FUNCTIONS/CLASSES

def main(sc,filename):
   textRDD = sc.textFile(filename)
   words = textRDD.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))
   wordcount = words.reduceByKey(add).collect()
   for wc in wordcount:
      print (wc[0],wc[1])

if __name__ == "__main__":

   # Configure Spark
   conf = SparkConf().setAppName(APP_NAME)
   conf = conf.setMaster("local[*]")
   sc   = SparkContext(conf=conf)
   sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId","XXXX")
   sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey","YYYY")
   filename = "s3a://bucket_name/sample.txt"
   # filename = sys.argv[1]
   # Execute Main functionality
   main(sc, filename)

Check this https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html. This will allow you submit python spark job [as a step to your emr cluster] via terminal. Else look at this i.e. how to do it using boto3 apis https://stackoverflow.com/questions/36706512/how-do-you-automate-pyspark-jobs-on-emr-using-boto3-or-otherwise — codinnvrends, Jun 17 '20 at 11:15
@Snigdhajyoti, I'm aware that I have to use spark-submit. However, I'm confused about the parameters that follow. — ouila, Jun 18 '20 at 07:33
On master node just do a `spark-submit --help` you will find all the parameter that is needed. And if you want to configure spark configs look in [the docs for configs](https://spark.apache.org/docs/latest/configuration.html) — Snigdhajyoti, Jun 18 '20 at 08:35
Alright. Also, let's say I want to create a step to submit my job. The "Add Step" option allows me to specify one script to run. So, does that mean that I would need 1 step to execute 1 single .py script? Also, can one add multiple steps? @Snigdhajyoti — ouila, Jun 18 '20 at 08:57

score 1 · Answer 1 · answered Jun 17 '20 at 18:23

1

You can run this command:

spark-submit s3://your_bucket/your_program.py

if you need to run the script using python3, you can run this command before spark-submit:

export PYSPARK_PYTHON=python3.6

Remember to save your program in a bucket before spark-submit.

answered Jun 17 '20 at 18:23

Lucas Penna

61
7

2

Is it not possible to run the code without storing it in an S3 bucket? – ouila Jun 18 '20 at 07:40
Also, do I have to create a step if I want to submit a program that's on the S3 bucket? – ouila Jun 18 '20 at 08:59
No, it's not necessary to create a step. You can just store the program in a bucket and then run spark-submit on EMR terminal. I think this is the easier way to achieve this. – Lucas Penna Jun 18 '20 at 12:36
1

why is it necessary to have the program in a bucket before doing spark-submit? – ouila Jun 19 '20 at 09:01
The spark-submit command you suggested is not working for some reason. I used the command on the terminal after SSH-ing into the master node. However, I am getting an EmrFileSystem error. The .py code I want to run is on the S3 bucket. – ouila Jun 25 '20 at 08:40
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found – ouila Jun 29 '20 at 07:23
It works without any issue!! – Jugal Panchal Jun 17 '22 at 04:26

Submitting a pyspark job to Amazon EMR cluster from terminal

1 Answers1