Dataproc - pyspark job submit - how to access yaml file contained within a zip file passed using --files flag

Question

Below is my dataproc job submit command. I pass the project artifacts as a zip file to the "--files" flag

gcloud dataproc jobs submit pyspark --cluster=test_cluster --region us-central1 gs://test-gcp-project-123/main.py --files=gs://test-gcp-project-123/app_code_v2.zip

Following are the contents of "app_code_v2.zip".

I'm able to add "app_code_v2.zip" to the path using below code snippet, and access the python modules, but how do I access the "yml" files present in the zip package? those yml files contains the configs. Should I explicitly unzip the folder and copy to the working directory of the master node? Is there a better way to handle this?

if os.path.exists('app_code_v2.zip'):
   sys.path.insert(0, 'app_code_v2.zip')

You write a python code to read from the YAML file or dirty way to do it is to unzip it.You can also check --propertiesFile and this link-https://stackoverflow.com/questions/65421933/submit-a-pyspark-job-with-a-config-file-on-dataproc — Subash, Jul 25 '22 at 08:26
@Subash - I have more than 50 config files, so passing them one by one doesn't seem nice. So I zip it and pass it as an --archive argument. I was expecting it to get automatically extracted to the working directory of each executor. — Tom J Muthirenthi, May 11 '23 at 12:47

Dagang · Answer 1 · 2022-07-25T22:54:27.623

0

You might want to 1) extract the YAML files first, then add them explicitly to the flag like --files=<zip>,<yaml>,..., 2) or use --archives=<zip>, which will be automatically extracted to executor work dirs. In both ways, you can get the actual path of the file with SparkFiles.get(filename). See more info on the flags in this doc.

Note that files passed through --files and --archives are available for Spark executors only. This behavior is consistent with spark-submit. If you need the files to be accessible by Spark driver, consider using an init action to put the files somewhere in the local filesystem explictly.

edited Jul 25 '22 at 22:54

answered Jul 25 '22 at 22:48

Dagang

24,586
26
88
133

--archives=, the zip file doesn't get extracted if we are passing from a local directory. We should add a --archives=#extract_path to get it extracted to the working directory. If we are adding #, gcloud is expecting it to be a gcs path. With a local path, I am unable to give #. – Tom J Muthirenthi May 11 '23 at 12:55

Dataproc - pyspark job submit - how to access yaml file contained within a zip file passed using --files flag

1 Answers1

Linked