3

I'm trying to execute a DAG which create a Dataproc Cluster at Cloud Composer. But It fails when trying to save on Big Query. I suppose that is missing a jar file ( --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar) but I don't know how to add to my code.

code:

submit_job = DataprocSubmitJobOperator(
        task_id="pyspark_task", 
        job=PYSPARK_JOB, 
        location=REGION, 
        project_id=PROJECT_ID)

If a call this job at the Cluster, it works.

gcloud dataproc jobs submit pyspark --cluster cluster-bc4b --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar --region us-central1 ~/examen/ETL/loadBQ.py

But I don't know how can I replicate on Airflow

Code on PySpark:

df.write .format("bigquery") .mode("append") .option("temporaryGcsBucket","ds1-dataproc/temp") .save("test-opi-330322.test.Base3")

1 Answers1

1

In your example

submit_job = DataprocSubmitJobOperator(
        task_id="pyspark_task", 
        job=PYSPARK_JOB, 
        location=REGION, 
        project_id=PROJECT_ID)

The jars should be part of PYSPARK_JOB like

PYSPARK_JOB = {
    "reference": {"project_id": PROJECT_ID},
    "placement": {"cluster_name": CLUSTER_NAME},
    "pyspark_job": {
      "main_python_file_uri": PYSPARK_URI,
      "jar_file_uris": ["gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"],
    },
}

See this doc.

Dagang
  • 24,586
  • 26
  • 88
  • 133