0

I am new to airflow so apoliges if this has been asked somewhere.

I have a query i run in hive that is partitioned on year month so e.g. 202001.

how can i run a query which specifies a variable for different values within the query in airflow? eg. taking this example

from airflow import DAG
from airflow.operators.mysql_operator import MySqlOperator

default_arg = {'owner': 'airflow', 'start_date': '2020-02-28'}

dag = DAG('simple-mysql-dag',
          default_args=default_arg,
          schedule_interval='00 11 2 * *')

mysql_task = MySqlOperator(dag=dag,
                           mysql_conn_id='mysql_default', 
                           task_id='mysql_task'
                           sql='<path>/sample_sql.sql',
                       params={'test_user_id': -99})

where my sample_sql.hql looks like:

ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
    cpd_ym = ${ym}
) PURGE;

INSERT INTO sample_df
PARTITION (
    cpd_ym = ${ym}
)
SELECT
  * 

from sourcedf 

; 

ANALYZE TABLE sample_df
PARTITION (
    cpd_ym = ${ym}
)
COMPUTE STATISTICS;

ANALYZE TABLE sample_df
PARTITION (
    cpd_ym = ${ym}
)
COMPUTE STATISTICS FOR COLUMNS;

i want to run the above for different values of ym using airflow e.g. between 202001 and 202110 how can i do this?

Maths12
  • 852
  • 3
  • 17
  • 31

1 Answers1

1

I'm a bit confused because you are asking about Hive yet you show example of MySqlOperator. In any case assuming the the sql/hql parameter is templated you can use execution_date directly in your query. Thus you can extract the year & month to be used for the partition value.

Example:

mysql_task = MySqlOperator(
    dag=dag,
    task_id='mysql_task',
    sql="""SELECT {{ execution_date.strftime('%y%m') }}""",
)

enter image description here

So in your sample_sql.hql it will be:

ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
    cpd_ym = {{ execution_date.strftime('%y%m') }}
) PURGE;

You mentioned that you are new to Airflow so make sure you are aware what execution_date is and how it's being calculated (if you are not check this answer). You can do string manipulations to other macros as well. Choose the macro that is suitable to your needs (execution_date / prev_execution_date / next_execution_date / etc...).

Elad Kalif
  • 14,110
  • 2
  • 17
  • 49
  • so would i need to do somehing like : for date in range(20201,202105) : mysql_task = MySqlOperator( dag=dag, task_id='mysql_task', sql="""SELECT {{ date }}""", ) ?? – Maths12 Aug 23 '21 at 10:25
  • Why? if you create a daily runs then each run process a single day. If you want to run for 2 days no need for a loop - just execute two dagruns. – Elad Kalif Aug 23 '21 at 11:58