3

I am using google dataproc cluster to run spark job, the script is in python.

When there is only one script (test.py for example), i can submit job with the following command:

gcloud dataproc jobs submit pyspark --cluster analyse ./test.py

But now test.py import modules from other scripts written by myself, how can i specify the dependency in the command ?

肉肉Linda
  • 568
  • 1
  • 6
  • 13

2 Answers2

3

You could use the --py-files option mentioned here.

Nexaspx
  • 371
  • 4
  • 20
tix
  • 2,138
  • 11
  • 18
1

If you have a structure as

- maindir - lib - lib.py
          - run - script.py

You could include additional files with the --files flag or the --py-files flag

gcloud dataproc jobs submit pyspark --cluster=clustername --region=regionname --files /lib/lib.py /run/script.py

and you can import in script.py as

from lib import something

However, I am not aware of a method to avoid the tedious process of adding the file list manually. Please check Submit a python project to dataproc job for a more detailed explaination

Galuoises
  • 2,630
  • 24
  • 30