There are 2 APIs we can use to connect Livy and Airflow:
- Using LivyBatchOperator
- Using LivyOperator
In the following example, i will cover LivyOperator API.
LivyOperator
Step1: Update the livy configuration:
Login to airflow ui --> click on Admin tab --> Connections --> Search for livy. Click on edit button and update the Host and Port parameters.
Step2: Install the apache-airflow-providers-apache-livy
pip install apache-airflow-providers-apache-livy
Step3: Create the data file under $AIRFLOW_HOME/dags
directory.
vi $AIRFLOW_HOME/dags/livy_operator_sparkpi_dag.py
from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.providers.apache.livy.operators.livy import LivyOperator
default_args = {
'owner': 'RangaReddy',
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
# Initiate DAG
livy_operator_sparkpi_dag = DAG(
dag_id = "livy_operator_sparkpi_dag",
default_args=default_args,
schedule_interval='@once',
start_date = datetime(2022, 3, 2),
tags=['example', 'spark', 'livy']
)
# define livy task with LivyOperator
livy_sparkpi_submit_task = LivyOperator(
file="/root/spark-3.2.1-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.1.jar",
class_name="org.apache.spark.examples.SparkPi",
driver_memory="1g",
driver_cores=1,
executor_memory="1g",
executor_cores=2,
num_executors=1,
name="LivyOperator SparkPi",
task_id="livy_sparkpi_submit_task",
dag=livy_operator_sparkpi_dag,
)
begin_task = DummyOperator(task_id="begin_task")
end_task = DummyOperator(task_id="end_task")
begin_task >> livy_sparkpi_submit_task >> end_task
LIVY_HOST=192.168.0.1
curl http://${LIVY_HOST}:8998/batches/0/log | python3 -m json.tool
Output:
"Pi is roughly 3.14144103141441"