I am trying to run a spark job on the Dataproc cluster in GCP. Where all my src code is zipped and stored in the GCS bucket. Additionally, I have the main python file and additional jars in the GCS bucket itself.
Now, when I try to do spark-submit, the main python file and jars are copied to the cluster except the src code file(.zip)
Here is the spark-submit command I am using
gcloud dataproc jobs submit pyspark gs://gcs-bucket/spark-submit/main_file.py \
--project XYZ-data \
--cluster=ABC-v1 \
--region=us-central1 \
--jars gs://qc-dmart/tmp/gcs-connector-hadoop3-2.2.2-shaded.jar,gs://qc-dmart/tmp/spark-bigquery-with-dependencies_2.12-0.24.2.jar
--archives gs://gcs-bucket/spark-submit/src/pyfiles.zip
-- /bin/sh -c "gsutil cp gs://gcs-bucket/spark-submit/src/pyfiles.zip . && unzip -n pyfiles.zip && chmod +x"
-- --config-path=../configs env=dev
Here I tried using
- --archives and --files arguments separately but no luck
- Additionally based on the StackOverflow answer, I also tried to copy the files directly using gsutil as well. You can see how I am using this in the command above
None of the above trails is fruitful.
Here is the error is thrown from the main python file
File "/tmp/b1f7408ed1444754909e368cc1dba47f/promo_roi.py", line 10, in <module>
from src.promo_roi.compute.spark.context import SparkContext
ModuleNotFoundError: No module named 'src'
Any help would be really appreciated.