I have a DAG which is created by querying DynamoDB for a list and for each item in the list a task is created using a PythonOperator and adding it to the DAG. Not show in the example below but it's important to note that some of the items on the list depend upon other tasks so I'm using set_upstream
to enforce the dependencies.
- airflow_home
\- dags
\- workflow.py
workflow.py
def get_task_list():
# ... query dynamodb ...
def run_task(task):
# ... do stuff ...
dag = DAG(dag_id='my_dag', ...)
tasks = get_task_list()
for task in tasks:
t = PythonOperator(
task_id=task['id'],
provide_context=False,
dag=dag,
python_callable=run_task,
op_args=[task]
)
The problem is workflow.py
is getting run over and over (every time a task runs?) and my get_task_list()
method is getting throttled by AWS and throwing exceptions.
I thought it was because whenever run_task()
was called it was running all the globals in workflow.py
so I've tried moving run_task()
into a separate module like this:
- airflow_home
\- dags
\- workflow.py
\- mypackage
\- __init__
\- task.py
But it didn't change anything. I've even tried putting get_task_list()
into a SubDagOperator wrapped with a factory function, which still behaves the same way.
Is my problem related to these issues?
Also, why is workflow.py
getting run so often and why would an error thrown by get_task_list()
cause the individual task to fail when the task method doesn't reference workflow.py
and has no dependencies on it?
Most importantly, what would be the best way to both process the list in parallel and enforce any dependencies between items in the list?