2

I have encountered something called LivyBatchOperator but unable to find a very good example for it to submit pyspark applications in airflow. Any info on this would really be appreciated. Thanks in advance.

kavya
  • 75
  • 1
  • 10

1 Answers1

1

I come across this blog post which can help you to walk through available options on Airflow + Spark.

Here is an example of LivyBatchOperator and here is on how to install airflow-livy-operators.

I would recommend below options :

  1. AWS EMR : Use EmrAddStepsOperator
  2. Regular Spark Cluster : Use above mechanism to set up Livy operators in airflow. This will give you a slick configuration from the airflow servers perspective as well as using Livy in front of spark cluster.

Let me know your response !

Abdul
  • 126
  • 3
  • Thanks, the respective blogs helped me to start with. Can we pass a zip file in **file** parameter and a **class_name** in submitting pyspark applications through livy? – kavya Jul 01 '20 at 17:43
  • Yes there is an option to pass ZIP files using files argument not using file. files - Used to send list of ZIP files file - In case of python , use this as entry point to run the spark driver class_name - This will be class name for Java/Spark main class. Refer here for Livy API documentations which is a back bone of this LivyBatchOperator. https://livy.incubator.apache.org/docs/latest/rest-api.html – Abdul Jul 01 '20 at 20:02
  • I am getting issues when I tried this `LivyBatchOperator( task_id = 'spark_job', file = '/abc/xyz.zip', class_name = 'src.foo.py', py-files), "spark.submit.pyFiles":'/abc/lmn.zip' where src.foo.py is a file in xyz.zip` `Error: --py-files given but primary resource is not a Python script`. @Abdul – kavya Jul 02 '20 at 12:01