Pass packages to pyspark running on dataproc from airflow?

Question

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:

gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...

But how can I do it with Airflow's DataProcPySparkOperator?

For now we're adding this library to the cluster itself:

gcloud dataproc clusters create my-cluster \
  --region global \
  --zone europe-west1-d \
  ...
  --properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
  ...

This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?

score 1 · Answer 1 · answered Nov 14 '17 at 18:50

1

I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.

See: https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py

answered Nov 14 '17 at 18:50

tix

2,138
11
18

I saw this one, but the documentation says it's a "Map for the Pig properties", and I think I can only give jar urls in the "jars" property (i.e put the jars in storage myself) rather than use the `mysql:mysql-connector-java:6.0.6` format to resolve the dependency automatically. Or is that not the case? – Nira Nov 15 '17 at 07:56
1

Both points are correct. You can either put file in GCS your self, or use the maven package path. – tix Nov 15 '17 at 22:35
Just tried with `dataproc_pyspark_properties={'pig.additional.jars':'mysql:mysql-connector-java:6.0.6'}`, and also `dataproc_pyspark_properties={'jar':'mysql:mysql-connector-java:6.0.6'}` and `dataproc_pyspark_jars=['mysql:mysql-connector-java:6.0.6']`. None worked - first two result in ClassNotFoundException and the third in NullPointerException :-/ – Nira Nov 17 '17 at 10:08
2

It sounds like what you want is `spark.jars.packages` to provide a list of maven coordinates: https://spark.apache.org/docs/latest/configuration.html – tix Nov 17 '17 at 18:18
But can I pass that as a pig property? Or in any of the other DataProcPySparkOperator parameters? – Nira Nov 19 '17 at 08:13
I am not sure what you're asking; you're referring to a PySpark job (which does not accept any Pig properties). If this is the case, you want to use `dataproc_pyspark_properties={'spark.jars.packages'='mysql:mysql-connector-java:6.0.6'}` OR you can do `dataproc_pyspark_jars=['gs://my-bucket/..../mysql.jar']` – tix Nov 19 '17 at 20:28
I'm talking about the Airflow API. The documentation says dataproc_pyspark_properties are Pig properties, so I don't think I can pass Spark properties in it... Or can I? – Nira Nov 20 '17 at 20:51

Pass packages to pyspark running on dataproc from airflow?

1 Answers1