1

I am relatively new to using pyspark and have inherited a data pipeline built in spark. There is a main server that I connect to and execute via terminal the spark job using spark-submit, which then executes via master yarn via cluster deploy mode.

Here is the function that I use to kick off the process:

spark-submit --master yarn --num-executors 8 --executor-cores 3 --executor-memory 6g --name program_1 --deploy-mode cluster /home/hadoop/data-server/conf/blah/spark/1_program.py

The process works great, but I am very interested on setting up python/jupyter notebook to execute commands in a similar distributed manner. I am able to get a spark session working in the notebook but I can't have it run via master yarn and clusters. The process is just running on a single instance and is very slow. I tried launching jupyter notebook with configuration similar to spark-submit, but failed.

I have been reading a few blog posts about launching python notebook with the configuration as I launch my spark-submit. My attempts are not working.

Wanted to see if anyone can help me with running python with distributed spark and/or help me find the necessary input to execute jupyter notebook similar to spark-submit.

My python version is 2.7 and spark version is 2.2.1.

zad0xlik
  • 183
  • 1
  • 4
  • 14

0 Answers0