3

We have a huge DAG, with many small and fast tasks and a few big and time consuming tasks.

We want to run just a part of the DAG, and the easiest way that we found is to not add the task that we don't want to run. The problem is that our DAG has many co-dependencies, so it became a real challenge to not broke the dag when we want to skip some tasks.

Its there a way to add a status to the task by default? (for every run), something like:

# get the skip list from a env variable    
task_list = models.Variable.get('list_of_tasks_to_skip')

dag.skip(task_list)

or

for task in task_list:
    task.status = 'success'
Pablo
  • 3,135
  • 4
  • 27
  • 43
  • If you're marking all the tasks you don't care about as success without executing on every run, what is keeping you from just removing the tasks from the DAG altogether? – Ben Gregory Jun 19 '18 at 20:39
  • @BenGregory we run the dag often as a "full run", but sometimes a few task failed and we want to run just some sections of the dag. The co-dependency tree is huge and simple deleting a task will often break the DAG. so we want to mark all the dag as "succeed" except the task that are needed and his dependencies. – Pablo Jun 19 '18 at 20:49
  • 1
    Ok - I'm not sure of a way to set a task to a certain state by default before it has been evaluated. Broadly, if you're running a large DAG that only needs to run certain parts most of the time, I would suggest using the BranchPythonOperator (https://github.com/apache/incubator-airflow/blob/master/airflow/example_dags/example_branch_operator.py) at various point to determine which downstream tasks to execute and which to skip. – Ben Gregory Jun 19 '18 at 21:33
  • **@Pablo** could you solve this problem? Another approach that I could think of is to move all parts that need to be re-run out of the main DAG and model them as separate top-level DAGs themselves. Now in your main DAG, you can link them up using `ShortCircuitOperator` or `BranchPythonOperator` coupled with a `TriggerDagRunOperator`. The problem that persists with this approach is that if you have still more parts that are supposed to run after these small top-level DAGs, then you'll need something like `ExternalTaskSensor` to await completion of these small DAGs for triggering them. Untidy – y2k-shubham Sep 06 '18 at 05:14
  • @y2k-shubham yes, we used a workaround a bit complex, but useful for our problem. As you can see in the main question, we where looking for a way to modify the dag using an `env-var` (dynamically), we din't find a way to skip tasks in airflow, but we realized that is possible to create a dag based on an `env-var`. All our task where basically the same, so we create them in a loop based on a list of task saved in a env-var. then, when we want to skip´some we modify that var with an graph algorithm. hope it helps. – Pablo Sep 25 '18 at 16:21
  • Consider adding an answer describing your solution / workaround – y2k-shubham Sep 26 '18 at 05:26

2 Answers2

2

As mentioned in the comments, you should use the BranchPythonOperator (or ShortCircuitOperator) to prevent the time-consuming tasks from executing. If you need downstream operators of these time-consuming tasks to run, you can use the TriggerRule.ALL_DONE to have those operators run, but note this will run even when the upstream operators fail.

You can use Airflow Variables to affect these BranchPythonOperators without having to update the DAG, eg:

from airflow.models import Variable

def branch_python_operator_callable()
  return Variable.get('time_consuming_operator_var')

and use branch_python_operator_callable as the Python callable for your BranchPythonOperator.

Trevor Edwards
  • 431
  • 2
  • 6
  • having a switch `BranchPythonOperator ` or a `ShortCircuitOperator` for every task, (or every combination of the dag execution) it's something we want to avoid (if is possible). sames as having a dag for every type of execution (considering the tasks codependencies). Its there a way to keep the task in the dag but skip them ? or mark them as successful just given them a property ?. – Pablo Jun 19 '18 at 22:59
  • I am looking for similar solution. I have to create task so it can still be visible in UI but at the same time I want to mark it as skipped. – alltej Jul 15 '20 at 18:20
0

Have you considered using a decorator/higher order function around your callable?

I'm thinking of using something like the following:

def conf_task_id_skip(python_callable):
    def skip_if_configured(*args, **context):
        task_id = context["task_id"]
        dag_run = context["dag_run"]
        skip_task_ids = dag_run.conf.get("skip_task_ids", [])

        if skip_task_ids and task_id in skip_task_ids:
            return None
        else:
            return python_callable(*args, **context)
    
    return skip_if_configured
PythonOperator(
    task_id="task_id", 
    python_callable=conf_task_id_skip(task_callable)
)

Then if I want I can pass manually the tasks I want to skip over (and still succeed).

If you wish so, you could also increase the robustness by adding a check for if skipping is disallowed (e.g. in prod):

def conf_task_id_skip(python_callable):
    def skip_if_configured(*args, **context):
        if Variable.get("disallow_conf_task_id_skip"):
            return python_callable(*args, **context)

        task_id = context["task_id"]
        dag_run = context["dag_run"]
        skip_task_ids = dag_run.conf.get("skip_task_ids", [])

        if skip_task_ids and task_id in skip_task_ids:
            return None
        else:
            return python_callable(*args, **context)
    
    return skip_if_configured
Philippe Hebert
  • 1,616
  • 2
  • 24
  • 51