0

What I want to do :
- compare missing ID from MySQL and BigQuery
- get all data from MySQL where id in missing ID

table = 'orders'

def get_missing_ids():
    aws_id = get_aws_id(table)
    bq_id = get_bq_id(table)
    missing_id = [np.setdiff1d(aws_id,bq_id)]
    missing_ids = ', '.join(map(str,missing_id))
    return missing_ids

missing_ids = get_missing_ids()

get_missing_data = MysqlToGCS(
    task_id = 'get_orders_from_aws',
    sql = """select *
        from orders 
        where id in ({{params.missing_ids}})""",
    params = {'missing_ids':missing_ids},
    bucket = 'airflow_bucket',
    filename = 'data/orders/db-orders{{ds}}{}',
    mysql_conn_id = 'aws_readreplica',
    approx_max_file_size_bytes = 100000000,
    google_cloud_storage_conn_id = 'google_cloud_storage_default',
    dag=dag)

def print_done():
    print("done boiiiii")
    time.sleep(60)

task = PythonOperator(
        task_id='done',
        python_callable=print_done,
        dag=dag)

task.set_upstream(get_missing_data)

I read about Xcom, but I don't understand how to implement it here.

Xenon Drey
  • 171
  • 2
  • 14

2 Answers2

1

Recently, I am working to airflow topic and dealing with different database. so I think I should be able to share some experience.

  1. Airflow useful concept: DAG/Tasks: You can view & track in the airflow admin web->dag page.

    variable: set & get global parameter among different dags in airflow system level Xcome : set & get parameter amongh different tasks of certain dag level. Python Operator: it can be task instances. DB operator/model: it be can task instances or objects insides python functions.

  2. In my case, I only use python operator and db related Operator was used insides of python operator related functions.

3. in you case, you might do it in following pseudocode:

from airflow import DAG

from airflow.contrib.operators.mysql_to_gcs import MySqlToGoogleCloudStorageOperator  as mysqltogcs

from datetime import timedelta

table = 'orders'

==============

def get_missing_ids(ds, **kwargs):
    ti = kwargs['ti']
    aws_id = get_aws_id(table)
    bq_id = get_bq_id(table)
    missing_id = [np.setdiff1d(aws_id,bq_id)]
    missing_ids = ', '.join(map(str,missing_id))
    ti.xcom_push(key='missing_ids', value=missing_ids)

===============

def get_orders_from_aws(ds, **kwargs):
    missing_ids = ti.xcom_pull(key='missing_ids', task_ids='get_missing_ids')
    sql = f"select *  from orders where id in ({missing_ids})"
    MG = mysqltogcs(sql=sql,
                    bucket = 'airflow_bucket',
                    filename = 'data/orders/db-orders{{ds}}{}',
                    mysql_conn_id = 'aws_readreplica',
                    approx_max_file_size_bytes = 100000000,
                    google_cloud_storage_conn_id = 'google_cloud_storage_default'
                   )
    missing_data = MG.execute()




def print_done():
    print("done boiiiii")
    time.sleep(60)

==============

with DAG(dag_id="your_name", schedule_interval='timedelta(minute=5)') as dag:

        task_1 = PythonOperator( task_id ="get_missing_ids",
                        python_callable=get_missing_ids,
                        provide_context=True)

        task_2 = PythonOperator( task_id = 'get_orders_from_aws',
                        python_callable=get_orders_from_aws,
                        provide_context=True)

        task_3 = PythonOperator( task_id='done',
                        python_callable=print_done)

        task_1 >> task_3 >>task_3
Yong Wang
  • 1,200
  • 10
  • 15
0

I think the easiest way is to create a function create_sql_query which can look like this:

def create_sql_query():
    missing_ids = get_missing_ids()
    return f"select * from orders where id in ({missing_ids})"

Then you can change your get_missing_data to:

get_missing_data = MysqlToGCS(
    task_id = 'get_orders_from_aws',
    sql = create_sql_query(),
    ...

Actually you can even write like this:

get_missing_data = MysqlToGCS(
    task_id = 'get_orders_from_aws',
    sql = f"select * from orders where id in ({get_missing_ids()})",
    ...

but it doesn't look pretty as for me.

amoskaliov
  • 739
  • 4
  • 10