I'm trying to execute a DAG which create a Dataproc Cluster at Cloud Composer. But It fails when trying to save on Big Query. I suppose that is missing a jar file ( --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar) but I don't know how to add to my code.
code:
submit_job = DataprocSubmitJobOperator(
task_id="pyspark_task",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID)
If a call this job at the Cluster, it works.
gcloud dataproc jobs submit pyspark --cluster cluster-bc4b --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar --region us-central1 ~/examen/ETL/loadBQ.py
But I don't know how can I replicate on Airflow
Code on PySpark:
df.write .format("bigquery") .mode("append") .option("temporaryGcsBucket","ds1-dataproc/temp") .save("test-opi-330322.test.Base3")