4

I have following structure on Google Cloud Storage (GCS) bucket :

gs://my_bucket/py_scripts/
    wrapper.py
    mymodule.py
    _init__.py

I am running wrapper.py through Dataproc as a pyspark job and it imports mymodule using import mymodule at the start but the job is returning error saying no module named mymodule even though they are at the same path. This however works fine in the Unix environment.

Note that _init__.py is empty. Also tested from mymodule import myfunc but returns same error.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31
Bajwa
  • 91
  • 1
  • 8

1 Answers1

2

Can you provide your pyspark job submit command ? I suspect you are not passing "--py-files" params to provide other python files to job. Check for reference https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/pyspark . Dataproc will not assume files in same GS bucket as input to job.

Animesh
  • 74
  • 2
  • Thanks for your response. I am using this job submit command - gcloud beta dataproc workflow-templates add-job pyspark gs://mybucket/py_scripts/wrapper.py --step-id=01_python --workflow-template=wf_template --region europe-west1 -- (params) . So how should i pass the other mymodule.py in this command? And also what about multiple dependencies, like if mymodule.py imports another script mymodule2.py and so on. – Bajwa Apr 30 '20 at 04:59
  • You can zip files alternatively. Please check this stackoverflow question which discusses same https://stackoverflow.com/questions/61386462/submit-a-python-project-to-dataproc-job – Animesh Apr 30 '20 at 17:16