0

I am currently launching PySpark using DataProcPySparkOperator from Airflow with a script in Cloud Storage

run_pyspark_job = dataproc_operator.DataProcPySparkOperator(
    task_id='run-dataproc-pyspark',
    main='gs://my-repo/my-script.py',
    project_id=PROJECT_ID,
    cluster_name=CLUSTER_NAME,
    region='europe-west4'
)

Is there anyway to pass a script from Cloud Source Repositories ? For a given repository one can get the absolute link to the script but it does not seems to be accepted by the DAG.

https://source.cloud.google.com/my-organisation/my-repo/+/master:my-script.py

Is there any way to achieve it?

kwn
  • 909
  • 2
  • 13
  • 25

1 Answers1

0

All Python and Jar files referenced must be from a HDFS or HDFS-compatible file system, or located in a Google Cloud Storage bucket. For more information, you can refer to the Airflow documentation.

To create a Cloud Storage bucket, you can use following Make Bucket command:

gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data

You can do this as follows:

If you want to use your files from Cloud Source Repositories, firstly you need to clone repository and then, copy the contents of the data to Google Cloud Storage

gsutil cp -r dir1/dir2 gs://$DEVSHELL_PROJECT_ID-data

I hope you find the above pieces of information useful.

aga
  • 3,790
  • 3
  • 11
  • 18