I have a python project, whose folder has the structure
main_directory - lib - lib.py
- run - script.py
script.py
is
from lib.lib import add_two
spark = SparkSession \
.builder \
.master('yarn') \
.appName('script') \
.getOrCreate()
print(add_two(1,2))
and lib.py
is
def add_two(x,y):
return x+y
I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
run/script.py
But I receive the following error message:
from lib.lib import add_two
ModuleNotFoundError: No module named 'lib.lib'
Could you help me on how I should do to launch the job on Dataproc? The only way I have found to do it is to remove the absolute path, making this change to script.py
:
from lib import add_two
and the launch the job as
gcloud dataproc jobs submit pyspark --cluster=$CLUSTER_NAME --region=$REGION \
--files /lib/lib.py \
/run/script.py
However, I would like to avoid the tedious process to list the files manually every time.
Following the suggestion of @Igor, to pack in a zip file I have found that
zip -j --update -r libpack.zip /projectfolder/* && spark-submit --py-files libpack.zip /projectfolder/run/script.py
works. However, this puts all files in the same root folder in libpack.zip, so if there were files with the same names in subfolders this would not work.
Any suggestions?