0

I am using bashOperator to execute ETL script on GCP compute engine and some files can take more than 10hrs to complete.

Since I am using compute engine to execute how can I set bashoperator task to success and make the dag run to success so that I can start new dag to execute new ETL script on different compute engine?

(I can run multiple dags in parallel but I will have 20-30 ETL scripts running on different compute engines, so I need to mark the bashoperator to success once the execution on compute engine is started)

t1 = PythonOperator(
        task_id='check_running_file',
        python_callable=check_running_file,
        dag=dag
    )
t2 = BashOperator(
        task_id='start_vm',
        bash_command="gcloud compute instances start vm-1 --zone=zone",
        dag=dag
    )
bash_task = bash_operator.BashOperator(
    task_id='script_execution',
    bash_command='gcloud compute ssh --project '+PROJECT_ID+ ' --zone '+ZONE+' '+GCE_INSTANCE+' --command '+command,
    dag=dag)

def set_task_status(**context):
        utc_now = datetime.utcnow().replace(tzinfo=timezone.utc)
        task_instance=TaskInstance(task_id='bash_task', dag_id='process_folders', execution_date=utc_now)
        
        task_instance.set_state(state=State.SUCCESS)
        
set_task_instance = PythonOperator(
        task_id='set_status',
        python_callable=set_task_status,
        provide_context=True,
        dag=dag,
        )
set_task_instance.pre_execute = lambda **x: time.sleep(300)

t1>>t2>>[bash_task,set_task_instance ]

How do I set bash_task to success after 5min? I tried set_state but set_task_instance is throwing error and looking for dag rather than task.

Mazlum Tosun
  • 5,761
  • 1
  • 9
  • 23

1 Answers1

0

you can use the ti parameter available in the python_callable function set_task_status to get the task instance object of the bash_task. You can then use the set_state method to set the task state as success.

from airflow.models import TaskInstance
from airflow.utils.state import State

def set_task_status(**context):
    ti = context['ti']
    ti_success = TaskInstance(ti.dag_id, ti.task_id, ti.execution_date)
    ti_success.set_state(State.SUCCESS)

set_task_instance = PythonOperator(
        task_id='set_status',
        python_callable=set_task_status,
        provide_context=True,
        dag=dag,
)

bash_task >> set_task_instance
Joevanie
  • 489
  • 2
  • 5
  • but in this case set_task_instance will run after bash_task is successful but my bash_task takes 10hrs to complete, how can I make it to success after few min of bash task is started, so that the script gets executing in compute engine and I can mark this bash_task has success. – avinash reddy May 10 '23 at 16:57