5

I have used poetry to create a wheel file. I am running following spark-submit command , but it is not working. I think I am missing something

spark-submit --py-files /path/to/wheel

Please note that I have referred to below as well, but did not get much details as I am new to Python. how to pass python package to spark job and invoke main file from package with arguments

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Sachit Murarka
  • 137
  • 2
  • 12
  • You still need to add another parameter for which script to invoke with the main function – OneCricketeer Oct 23 '20 at 15:47
  • 1
    Yeah I added another parameter. It was Spark-submit --py-files wheelfile driver.py This driver was calling the function inside wheelfile. But then this driver and wheel are in same location essentially. What is the use of wheel then? Because if I run the command with spark-submit driver.py . Then also its the same Right?? – Sachit Murarka Oct 23 '20 at 15:56
  • Not sure what you mean by "location". Spark applications are distributed. The `--py-files` argument will serialize those files and distribute them into the cluster. Then the `driver.py` file tells which module from those files to execute. Sure, they need to all be on the `PYTHONPATH`, but that's not specific to Spark – OneCricketeer Oct 23 '20 at 16:42
  • Location refers to the same path (Mounted Path). What will the difference between these 2 approaches: spark-submit driver.py ****************************** spark-submit --py-files wheelfile driver.py – Sachit Murarka Oct 23 '20 at 17:14
  • Assuming `driver.py` will import some module that is contained in the wheel/egg/zip that is uploaded, then the first option will be unable to find that code – OneCricketeer Oct 23 '20 at 17:47
  • Actually I have created wheel out of my main project itself. I am not sure if I am doing it right or not. I create a poetry package first , then wrote code inside package file for spark. Then did poetry build which created wheel file. Then running this command. and I have observed that even when I am passing this wheel file in --py-file , it is not taking the source code from wheel file. Instead it is taking it from package folder . – Sachit Murarka Oct 23 '20 at 17:54
  • 1
    I assume youre using `local` as your spark master? Not a cluster/remote machine? – OneCricketeer Oct 23 '20 at 17:55
  • Yes. For testing I am using it on Local. So basically in distributed mode , It will refer to the wheel file only to read the transformation in the source code , instead of reading the code from package. Is it correct? – Sachit Murarka Oct 23 '20 at 17:58
  • Does poetry only build the wheel, or does it also fully install it into the python environment? That might be what youre seeing – OneCricketeer Oct 23 '20 at 18:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/223524/discussion-between-sachit-murarka-and-onecricketeer). – Sachit Murarka Oct 23 '20 at 18:01

1 Answers1

2

Wheel file can be executed as a part of below spark-submit command

spark-submit --deploy-mode cluster --py-files /path/to/wheel main_file.py

Sachit Murarka
  • 137
  • 2
  • 12
  • 2
    I don't think this is correct. From the documentation (https://spark.apache.org/docs/latest/configuration.htm) about `spark.submit.pyFiles`: `Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps`. Since basically they do the same, I don't expect It to handle wheel files. – ciurlaro Feb 09 '21 at 10:58
  • 2
    Wheel files works fine. I have tried it and tested it, it is working fine. – Sachit Murarka Feb 10 '21 at 17:40
  • 2
    Then they really should update the documentation :) – ciurlaro Feb 11 '21 at 12:10
  • @SachitMurarka how did you test it exactly? I can't make it work. – godot Sep 26 '21 at 09:14
  • 1
    @SachitMurarka is `main_file.py` something inside your `.whl` file or a separate python file? – Davos Oct 19 '21 at 13:28
  • 3
    Nevermind I see it is a separate file. I think @CesareIurlaro is technically correct - whl files are zipimport compatible but using `--py-files` will not install the wheel. If your wheel is just a zip archive with no install required it will work, but if your wheel requires / expects to be installed then it won't work for example if there are c/c++ dependencies in the wheel. See https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html – Davos Oct 19 '21 at 15:14