Accessing file as an argument to PySpark driver - Google Dataproc Jobs

Question

As a part of PySpark Job on gcloud dataproc we have multiple files , one of them is a json which is passed to the driver python file. The driver file itself sits on a google storage(gs file system).

We are trying to submit this job using the gcloud dataproc api for python.

The configuration used for the job object in submit_job is :

job_details = {
        'placement': {
            'cluster_name': cluster_name
        },
        'pyspark_job': {
            'main_python_file_uri': 'gs://driver_file.py',
            'python_file_uris':['gs://package.whl'],
            'file_uris':['gs://config.json'],
            'args':['gs://config.json']
        }
    }

My understanding from this is config.json should be made available to the driver , which gcloud properly does from the logs - Downloading gs://config.json to /tmp/tmprandomnumber/fop_gcp_1.json

From the file_uris of gcloud documentation page this seems correct-

HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks. Useful for naively parallel tasks.

Now after a lot of debugging we stumbled upon SparkFiles.get('config.json') which is meant to get the files which were uploaded to the driver based on this question

But this also fails with [Errno 2] No such file or directory: '/hadoop/spark/tmp/spark-random-number/userFiles-random-number/config.json'

score 2 · Accepted Answer · answered Apr 14 '20 at 08:13

Alright figured it, out posting it so that it can help someone out there!

Use the SparkContext.addFile -

Add a file to be downloaded with this Spark job on every node.

SparkContext.addFile(config_file_name)

And then a simple

from pyspark import SparkFiles
SparkFiles.get(config_file_name)

Note: All this can only be done after the SparkContext is initialised in your code.

Accessing file as an argument to PySpark driver - Google Dataproc Jobs

1 Answers1