I must be doing a wrong assumption because I don't find a way to solve this. I want to spark-submit a .egg file, that should be: spark-submit --py-files mypkg.egg main.py argv1 argv2
, that only needs the .egg file. But when I execute this, I get:
python: can't open file 'C:/Users/israel/Desktop/spark_python_maven/main.py': [Errno 2] No such file or directory
After researching I set a folder structure, a init.py file , and added path/to/main.py to the spark-submit, then the error changed to :
File "C:/Users/israel/Desktop/spark_python_maven/src/mypkg/main.py", line 3, in <module>
from src.mypkg.data_layer import *
ModuleNotFoundError: No module named 'src'
The only way I found to make this work is to remove the init.py and set the path to the main.py and the others .py files, so I need both the .egg and the .py files when I was expecting to only need the .egg file.
my main.pylooks like:
import os
import sys
from src.mypkg.data_layer import *
from pyspark.sql import SparkSession
if __name__ == "__main__" :
print("hello world!")
My setup.py looks like:
from setuptools import setup, find_packages
setup(
name='mypkg',
version='0.1',
packages = find_packages('src'), # include all packages under src
package_dir = {'':'src'} # tell distutils packages are under src
)
My init.py looks like :
from src.mypkg import *
from .mypkg import *
My project structure looks like:
Where setup.py is at the same level as src, and init.py is at the same level as mypkg.
What I want to do is create an .egg file that has all I need to spark-submit with argvs. What am I doing wrong?
Thanks in advance.