1

I'm newbie on GCP and I'm struggling with submitting pyspark job in Dataproc.

I have a python script depends on a config.yaml file. And I notice that when I submit the job everything is executed under /tmp/.

How can I make available that config file in the /tmp/ folder?

At the moment, I get this error:

12/22/2020 10:12:27 AM root         INFO     Read config file.
Traceback (most recent call last):
  File "/tmp/job-test4/train.py", line 252, in <module>
    run_training(args)
  File "/tmp/job-test4/train.py", line 205, in run_training
    with open(args.configfile, "r") as cf:
FileNotFoundError: [Errno 2] No such file or directory: 'gs://network-spark-migrate/model/demo-config.yml'

Thanks in advance

IlNardo92
  • 25
  • 1
  • 11
  • Have a look at [this answer](https://stackoverflow.com/a/37685456/9671314). Try use the `--files` parameter (See the [doc](https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/pyspark#--files)). – cyxxy Dec 23 '20 at 21:20
  • Thanks @cyxxy. I figured it out thanks your suggestion. – IlNardo92 Dec 24 '20 at 14:44

1 Answers1

0

Below a snippet worked for me:

gcloud dataproc jobs submit pyspark gs://network-spark-migrate/model/train.py --cluster train-spark-demo --region europe-west6 --files=gs://network-spark-migrate/model/demo-config.yml -- --configfile ./demo-config.yml
IlNardo92
  • 25
  • 1
  • 11