1

I am trying to schedule a job in EMR using airflow livy operator. Here is the example code I followed. The issue here is... nowhere Livy connection string (Host name & Port) is specified. How do I provide the Livy Server host name & port for the operator?

Also, the operator has parameter livy_conn_id, which in the example is set a value of livy_conn_default. Is that the right value?... or do I have set some other value?

Raj
  • 2,368
  • 6
  • 34
  • 52

2 Answers2

2

You should be having livy_conn_default under connections in Admin tab of your Airflow dashboard, If that's set alright then yes, you can use this. Otherwise, you can change this or create another connection id and use that in livy_conn_id

A.B
  • 20,110
  • 3
  • 37
  • 71
0

There are 2 APIs we can use to connect Livy and Airflow:

  1. Using LivyBatchOperator
  2. Using LivyOperator

In the following example, i will cover LivyOperator API.

LivyOperator

Step1: Update the livy configuration:

Login to airflow ui --> click on Admin tab --> Connections --> Search for livy. Click on edit button and update the Host and Port parameters.

Step2: Install the apache-airflow-providers-apache-livy

pip install apache-airflow-providers-apache-livy

Step3: Create the data file under $AIRFLOW_HOME/dags directory.

vi $AIRFLOW_HOME/dags/livy_operator_sparkpi_dag.py

from datetime import timedelta, datetime
from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.providers.apache.livy.operators.livy import LivyOperator

default_args = {
    'owner': 'RangaReddy',
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
}

# Initiate DAG
livy_operator_sparkpi_dag = DAG(
    dag_id = "livy_operator_sparkpi_dag",
    default_args=default_args,
    schedule_interval='@once',
    start_date = datetime(2022, 3, 2),
    tags=['example', 'spark', 'livy']
)

# define livy task with LivyOperator
livy_sparkpi_submit_task = LivyOperator(
    file="/root/spark-3.2.1-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.1.jar",
    class_name="org.apache.spark.examples.SparkPi",
    driver_memory="1g",
    driver_cores=1,
    executor_memory="1g",
    executor_cores=2,
    num_executors=1,
    name="LivyOperator SparkPi",
    task_id="livy_sparkpi_submit_task",
    dag=livy_operator_sparkpi_dag,
)

begin_task = DummyOperator(task_id="begin_task")
end_task = DummyOperator(task_id="end_task")

begin_task >> livy_sparkpi_submit_task >> end_task
LIVY_HOST=192.168.0.1
curl http://${LIVY_HOST}:8998/batches/0/log | python3 -m json.tool

Output:

"Pi is roughly 3.14144103141441"
Ranga Reddy
  • 2,936
  • 4
  • 29
  • 41