As a part of PySpark
Job on gcloud dataproc
we have multiple files , one of them is a json
which is passed to the driver python file. The driver file itself sits on a google storage(gs file system).
We are trying to submit this job using the gcloud dataproc api for python.
The configuration used for the job object in submit_job is :
job_details = {
'placement': {
'cluster_name': cluster_name
},
'pyspark_job': {
'main_python_file_uri': 'gs://driver_file.py',
'python_file_uris':['gs://package.whl'],
'file_uris':['gs://config.json'],
'args':['gs://config.json']
}
}
My understanding from this is config.json
should be made available to the driver , which gcloud properly does from the logs - Downloading gs://config.json to /tmp/tmprandomnumber/fop_gcp_1.json
From the file_uris
of gcloud documentation page this seems correct-
HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks. Useful for naively parallel tasks.
Now after a lot of debugging we stumbled upon SparkFiles.get('config.json')
which is meant to get the files which were uploaded to the driver based on this question
But this also fails with [Errno 2] No such file or directory: '/hadoop/spark/tmp/spark-random-number/userFiles-random-number/config.json'