2

I have a DAG which is created by querying DynamoDB for a list and for each item in the list a task is created using a PythonOperator and adding it to the DAG. Not show in the example below but it's important to note that some of the items on the list depend upon other tasks so I'm using set_upstream to enforce the dependencies.

- airflow_home
  \- dags
    \- workflow.py

workflow.py

def get_task_list():
    # ... query dynamodb ...

def run_task(task):
    # ... do stuff ...

dag = DAG(dag_id='my_dag', ...)
tasks = get_task_list()
for task in tasks:
    t = PythonOperator(
        task_id=task['id'],
        provide_context=False,
        dag=dag,
        python_callable=run_task,
        op_args=[task]
    )

The problem is workflow.py is getting run over and over (every time a task runs?) and my get_task_list() method is getting throttled by AWS and throwing exceptions.

I thought it was because whenever run_task() was called it was running all the globals in workflow.py so I've tried moving run_task() into a separate module like this:

- airflow_home
  \- dags
    \- workflow.py
    \- mypackage
      \- __init__
      \- task.py

But it didn't change anything. I've even tried putting get_task_list() into a SubDagOperator wrapped with a factory function, which still behaves the same way.

Is my problem related to these issues?

Also, why is workflow.py getting run so often and why would an error thrown by get_task_list() cause the individual task to fail when the task method doesn't reference workflow.py and has no dependencies on it?

Most importantly, what would be the best way to both process the list in parallel and enforce any dependencies between items in the list?

Mark J Miller
  • 4,751
  • 5
  • 44
  • 74

1 Answers1

5

As per the questions you referenced, airflow doesn't support task creation while dag is running.

Therefore what happens is that airflow will periodically generate the complete DAG definition before it starts a run. Ideally, the period of such generation should be the same as schedule interval for that DAG.

BUT it might be that every time airflow checks for changes in dag, it is also generating the complete dag, causing too many requests. That time is controlled using the configurations min_file_process_interval and dag_dir_list_interval in airflow.cfg.

Regarding the failure of tasks, they fail because the dag creation itself failed and airflow wasn't able to start them.

Him
  • 1,609
  • 12
  • 20
  • 2
    setting `min_file_process_interval` to 30 slowed down the calls to `get_task_list()` to 30 seconds and I stopped getting throttled. As for dynamic task creation I'm going to try and create a dag which will build another dag and save it to `globals()[dag_id]` as mentioned in the [FAQ](http://airflow.readthedocs.io/en/latest/faq.html?highlight=dynamic#how-can-i-create-dags-dynamically) – Mark J Miller Jul 15 '17 at 19:30