0

I am creating a pyspark application which is modular in nature. My code struture is like:

├── main.py
├── src
│   ├── __init__.py
│   ├── jobs
│   │   ├── __init__.py
│   │   └── logic.py
│   └── utils
│       ├── __init__.py
│       └── utility.py

My start script is main.py which in turn call logic function in the logic.py file.

I am running my project like spark-submit main.py

My question is do i need to mention other .py files in the spark-submit command or they automaticlly get imported.

I come accross a post which mention to zip the src folder and pass it as argument in --py-files.

Which is the right way?

Should i keep the current structure and run code from main.py like i do.?

Is there any difference in these two ways? (logical and performance wise)

Jugraj Singh
  • 529
  • 1
  • 6
  • 22

2 Answers2

0

When running locally there is no need to pass additional modules as zip with the --py-files flag, your code is local and so is the master and workers (they all have access to your code and modules necessary).

However, when you want to submit a job to a cluster, the master and the workers need to have access to your main.py file, along with all the modules it uses, thus, using the --py-files argument, you are specifying the location of the extra modules and both master and workers have access to every part of the code that needs to be run. If you just run spark-submit main.py on a cluster, it won't work because 1) the location of main.py is relative to your system, so the cluster won't be able to locate main.py and 2) due to ImportErrors of the main.py.

Note: The usage of this flag is before specifying the main.py and the zipped files (as well as main.py) need to be somewhere accessible to the whole cluster, not local on your machine, e.g. on an ftp server. For example, to submit on a cluster through mesos:

spark-submit --master mesos://path/to/service/spark --deploy-mode cluster --py-files http://somedomainforfileserving/src.zip  http://somedomainforfileserving/main.py

Edit: As for jar dependencies, e.g. the ElasticSearch connector, you can put the jars within the src, e.g. in src/jars, so that it gets zipped and distributed to all, and then when submitting to your cluster, reference the relative to src path to the jar. E.g.:

spark-submit --master mesos://path/to/service/spark --deploy-mode cluster --jars src/jars/elasticsearch-spark-someversion.jar --py-files http://somedomainforfileserving/src.zip  http://somedomainforfileserving/main.py
mkaran
  • 2,528
  • 20
  • 23
  • Do i zip the **whole** project or just the `src` part? and can i submit multiple zip files like instead of the `src` i have two modules `jobs` and `utils` outside the src folder. – Jugraj Singh Nov 08 '17 at 11:21
  • @JugrajSingh It will work both ways, but zipping just the `src` seems the most logical thing to do (`main.py` within the `zip` will be ignored either way,). – mkaran Nov 08 '17 at 11:26
  • Is there any good resources for production grade pyspark project that can answer all related questions?. – Jugraj Singh Nov 08 '17 at 11:27
  • Is there a particular Structure to pyspark project like where to specify jar dependencies and configuration files etc. or its all ones own confort. – Jugraj Singh Nov 08 '17 at 11:30
  • @JugrajSingh I haven't found any good resources for production grade spark deployment so far unfortunately, learning all this stuff in the usual, painful way, `research ... try ... catch ... repeat`. DataBricks and the Apache Spark documentation has the best resources I've seen up to now - along with SO. I will post if I find anything of great use. Good luck! – mkaran Nov 08 '17 at 11:37
  • @JugrajSingh Added edit for the `jars` dependencies. In general, make sure the master and the workers have access to whatever your code needs to run. – mkaran Nov 08 '17 at 11:44
0

Yes, zipping your project then submitting will work.
Move to your project folder. Run zip -r myproject.zip . .
Now you can spark-submit --py-files myproject.zip main.py in a terminal.