dataproc - unable to read archive files in pyspark

Question

I run my pyspark code using the below command in dataproc.

gcloud dataproc batches submit pyspark src/features/spark_script.py \
  --project=$PROJECT_ID \
  --region=$REGION \
  --deps-bucket=$DEPS_BUCKET \
  --container-image=$CONTAINER_IMAGE \
  --service-account=$SERVICE_ACCOUNT \
  --subnet=$SUBNETWORK_URI \
  --properties deployMode=cluster \
  --py-files dist/src.zip \
  --files pyproject.toml \
  --archives dist/resources.zip

Within my code I am using the below code to read the YAML file present in resources.zip:

with open("./resources/config/configuration.yaml", "r") as f:
  YAML_CONFIG = yaml.load(f, yaml.SafeLoader)

Still, it gives me

FileNotFoundError: [Errno 2] No such file or directory: './resources/config/configuration.yaml'

I tried giving deployMode=cluster as suggested here. Even that didn't help.

It works only when I add the --files argument to the spark-submit command, but I have many config files like this to be used in the code, so --archives is the best option for me.

how can effectively debug and run this?

Hi @Tom J Muthirenthi, Can you try the solution provided in this [StackOverflow Thread](https://stackoverflow.com/questions/73100322/dataproc-pyspark-job-submit-how-to-access-yaml-file-contained-within-a-zip-f)? Let me know if this helps. — Shipra Sarkar, Dec 02 '22 at 10:48
Does this answer your question? [Spark --archives file not found error, exception from executor](https://stackoverflow.com/questions/74563500/spark-archives-file-not-found-error-exception-from-executor) — Igor Dvorzhak, Dec 02 '22 at 18:06

dataproc - unable to read archive files in pyspark

0 Answers0