4

I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.

I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.). From documentation of airflow didn't got any help. Also tried many things but didn't worked out. Help is appreciated.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31

1 Answers1

7

To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.

For example, you can set deployMode like this:

DataProcSparkOperator(
    dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })

In this answer you can find more details.

Igor Dvorzhak
  • 4,360
  • 3
  • 17
  • 31