Copy src code ZIP to Dataproc cluster from GCS in Spark-Submit

Question

I am trying to run a spark job on the Dataproc cluster in GCP. Where all my src code is zipped and stored in the GCS bucket. Additionally, I have the main python file and additional jars in the GCS bucket itself.

Now, when I try to do spark-submit, the main python file and jars are copied to the cluster except the src code file(.zip)

Here is the spark-submit command I am using

gcloud dataproc jobs submit pyspark gs://gcs-bucket/spark-submit/main_file.py \
    --project XYZ-data \
    --cluster=ABC-v1 \
    --region=us-central1 \
    --jars gs://qc-dmart/tmp/gcs-connector-hadoop3-2.2.2-shaded.jar,gs://qc-dmart/tmp/spark-bigquery-with-dependencies_2.12-0.24.2.jar
    --archives gs://gcs-bucket/spark-submit/src/pyfiles.zip
    -- /bin/sh -c "gsutil cp gs://gcs-bucket/spark-submit/src/pyfiles.zip . && unzip -n pyfiles.zip && chmod +x"
    -- --config-path=../configs env=dev

Here I tried using

--archives and --files arguments separately but no luck
Additionally based on the StackOverflow answer, I also tried to copy the files directly using gsutil as well. You can see how I am using this in the command above

None of the above trails is fruitful.

Here is the error is thrown from the main python file

  File "/tmp/b1f7408ed1444754909e368cc1dba47f/promo_roi.py", line 10, in <module>
from src.promo_roi.compute.spark.context import SparkContext
ModuleNotFoundError: No module named 'src'

Any help would be really appreciated.

What's the structure inside of pyfiles.zip? Seems your code expects there is a dir named `src`. — Dagang, Jun 25 '22 at 18:49
Does this answer your question? [Dataproc does not unpack files passed as Archive](https://stackoverflow.com/questions/62645635/dataproc-does-not-unpack-files-passed-as-archive) — Igor Dvorzhak, Jul 10 '22 at 21:05

Copy src code ZIP to Dataproc cluster from GCS in Spark-Submit

0 Answers0