2

EDIT: This question was on how you should define parameters for python/jupyetr-notebook file in order to make a spark-submit on an EMR Amazon Spark Cluster...

Before: I am sorry for my dumb questions, but I am pretty newbie and I am stuck on the issue for a couple of days, and it seems there is no good guide on the web. I am following the Udacity Spark course. I have created Spark Yarn cluster on Amazon AWS (EMR), with one master and 3 slaves. I have created a jupyter notebook on top of that (and was able to run and see output using PySpark kernel). I had connected using PuttY to the cluster (I guess to the master node), I have downloaded the jupyter notebook to the local machine. However, when I try to run it I am stuck consistently on many types of errors. Currently, I run these commands:

/usr/bin/spark-submit --class "org.apache.spark.examples.SparkPi" --master yarn --deploy-mode cluster ./my-test-emr.ipynb 1>output-my-test-emr.log 2>error-my-test-emr.log
aws s3 cp ./error-my-test-emr.log s3://aws-emr-resources-750982214328-us-east-2/notebooks/e-8TP55R4K894W1BFRTNHUGJ90N/error-my-test-emr.log

I made both the error file and the jupyter notebook public so you can see them(link). I truly suspect the --class parameter (I pretty much guessed it, and I have read about it as an option for my troubles but no further information was given), can anyone give me an explanation what is it? Why do we need it? And how can I find out/set the true value? If anyone has the will so further explanation about JAR would be helpful - why should I turn my python program into java? And how should I do that? It seems like many questions have been asked here regarding it, but none explains it from the root...

Thanks in Advance

Eli Borodach
  • 554
  • 3
  • 9
  • 22

2 Answers2

0

When you mean locally, what version of Spark you downloaded and from where?

Generally, when I configure Spark in my laptop, I just run below command to run the Spark Pi example

spark-submit --class org.apache.spark.examples.SparkPi --master yarn \
--deploy-mode client SPARK_HOME/lib/spark-examples.jar 10

Where SPARK_HOME is the folder where you extract the tarball from the Spark website.

user2230605
  • 2,390
  • 6
  • 27
  • 45
0
  1. Export your notebook as .py file.
  2. You do not need to specify --class for a python script.
  3. You do not need to convert your python code to java/scala.
  4. Once you have your .py file, with some name, say test.py, this will work
spark-submit --master yarn --deploy-mode cluster ./test.py

srikanth holur
  • 760
  • 4
  • 11
  • It looks like you are on the right path, but still, after running the command: /usr/bin/spark-submit --master yarn --deploy-mode cluster ./test-emr.py 1>output-my-test-emr4.log 2>error-my-test-emr4.log I encounter this log: s3://aws-emr-resources-750982214328-us-east-2/notebooks/e-8TP55R4K894W1BFRTNHUGJ90N/error-my-test-emr4.log I still get an ugly error message... Do you have any idea what is the source of the problem? – Eli Borodach Jul 13 '20 at 14:38
  • I do not have access to that logfile – srikanth holur Jul 13 '20 at 14:52
  • I am still getting `access denied` – srikanth holur Jul 14 '20 at 17:00
  • Do you mind putting it in google drive and sharing it? – srikanth holur Jul 15 '20 at 12:02
  • Sure: https://drive.google.com/file/d/1VbU4QSWds541EdeOeSWr3JYWc0HonLn3/view?usp=sharing – Eli Borodach Jul 15 '20 at 13:09
  • It doesn't have all the logs. From EMR master node , run this command and you can see the application logs . `yarn logs -applicationId application_1594645488041_0004` . If you cannot figure out issue from that, share that with me. – srikanth holur Jul 15 '20 at 21:59
  • I am shutting down my EMR cluster from time to time I am using it, to save money. However, this did shed a lot of light on the issue. when I look in the logs, their end is complaining there is no spark module when I run the jupyter notebook I did it with PySpark Kernel, is there any parameter for spark-submit where I can choose :PySpark kernel? – Eli Borodach Jul 16 '20 at 14:04
  • While creating your Emr, you have to select `spark` service. That is all you need. – srikanth holur Jul 16 '20 at 14:07
  • I don't see any "service" category when creating the cluster, can you give more elaborated descriptions? – Eli Borodach Jul 16 '20 at 16:00
  • My bad, its `Applications` – srikanth holur Jul 16 '20 at 16:06
  • If you meant: Spark: Spark 2.4.5 on Hadoop 2.8.5 YARN and Zeppelin 0.8.2 - so it is indeed my choice – Eli Borodach Jul 16 '20 at 18:51
  • Is it possible to share the some sample code of how you are creating your spark session And application logs? – srikanth holur Jul 16 '20 at 18:55
  • Sure: https://drive.google.com/drive/folders/1fLjkIkcFsgHUXHAk1IV6MGayWQiWGcLT?usp=sharing I am beginning to get tired of this medium. Is it possible to pm you? – Eli Borodach Jul 17 '20 at 09:47
  • Sorry, I didn't know there is no pm on stackoverflow... – Eli Borodach Jul 17 '20 at 10:34
  • YOu do not have `spark` defined in your code. add `spark = SparkSession.builder.getOrCreate()` below the imports and it should work. – srikanth holur Jul 17 '20 at 13:20
  • 2
    Even after I added it, it still gave me a lot of errors. But finally I managed to make it somehow threw (It worked on the jupyter notebook though).. I think it happened because I used PySpark Kernel As I was referring before. Anyhow, I will edit my question and you have my deepest thanks from the bottom of my heart – Eli Borodach Jul 17 '20 at 16:04
  • Can run a notebook as cluster mode on emr notebooks? – Cristián Vargas Acevedo Jan 05 '22 at 00:19