How to skip Airflow SparkSubmitOperator task based on exit code that my Spark programm returns?

Question

My Spark submit application is doing some query and returning different exit code depends on the dataset state.

Is it possible to skip downstream tasks right after my spark-submit operator? I am thinking about skip_exit_code feature of BashOperator, which is surprisingly missing in all other operators.

def spark_job(task_id: str, cfg: ArgList, main_class: str, dag: DAG) -> BaseOperator:
    copy_args = cfg.to_arg_list()

    return SparkSubmitOperator(
        task_id=task_id,
        conn_id='spark_default',
        java_class=main_class,
        application=SPARK_JOBS_JAR,
        application_args=copy_args,
        total_executor_cores='2',
        executor_cores='1',
        executor_memory='1g',
        num_executors='1',
        name=task_id,
        verbose=False,
        driver_memory='1g',
        dag=dag
    )
cfg = CheckDataCfg(...)
check_data_task = spark_job('check-data', cfg, 'etljobs.spark.CheckDataRecieved', dag)

check_data_task >> '<my next task which I need to skip sometimes>'

UPDATE: Current SparkSubmitHook implementation does throw an exception if returncode is not 0. So there are only two workarounds which I found later:

Create custom SparkSubmitHook and SparkSubmitOperator classes to ignore user-defined non-zero exit codes in way that either AirflowSkipException exception is going to be thrown or return code value will be pushed to XCom for further use.
Use BashOperator instead. It already support skip_exit_code feature. You would need to construct all CLI Spark args manually in Python which is not a big deal.

Why not using `ShortCircurtOperator` that pulls last task returned value from Xcom and then decide if to continue to downstream tasks or not — Elad Kalif, Jul 03 '21 at 18:15
My driver application exits with specific return/exit code. I am wondering how to use it to skip or not to skip downstream tasks. — Alexey Novakov, Jul 04 '21 at 08:32

Elad Kalif · Answer 1 · 2021-07-04T09:38:33.473

1

The SparkSubmitHook has _spark_exit_code that can be used here. We can create a custom operator that inherits all SparkSubmitOperator functionality with addition of returning the _spark_exit_code value.

I didn't test it but I think the following code should work for you:

from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.operators.python_operator import ShortCircuitOperator

def shortcircuit_fn(**context):
    exit_code = context['ti'].xcom_pull(task_ids='check-data')
    if exit_code=='125': # Place the codes here
        return True
    return False


class MySparkSubmitOperator(SparkSubmitOperator):

    def execute(self, context):
        super().execute(context)
        return self._hook._spark_exit_code

with DAG(dag_id='spar',
         default_args=default_args,
         schedule_interval=None,
         ) as dag:
    spark_op = MySparkSubmitOperator(task_id='check-data',..., do_xcom_push=True)
    short_op = ShortCircuitOperator(task_id='short_circuit', python_callable=shortcircuit_fn)
    next_op = AnyOperator()
    spark_op >> short_op >> next_op

This is how it works: MySparkSubmitOperator will push to xcom the value of _spark_exit_code then ShortCircuitOperator will verify it against the expected codes if the condition is met the workflow will continue if not it will mark all downstream tasks as skipped.

edited Jul 04 '21 at 09:38

answered Jul 04 '21 at 09:13

Elad Kalif

14,110
2
17
49

Nice solution, but it seems `_spark_exit_code` field is used only when Spark app is running in Kubernetes cluster mode. In my case, I am not running it in Kubernetes :-( – Alexey Novakov Jul 04 '21 at 09:31
@AlexeyNovakov There is a special case for K8s but it should work for non k8s https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/apache/spark/hooks/spark_submit.py#L442:L452 – Elad Kalif Jul 04 '21 at 09:36
I mean that spark_exit_code is initialized only in kubernetes case, otherwise its value is None: https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/apache/spark/hooks/spark_submit.py#L503 – Alexey Novakov Jul 04 '21 at 11:41
I'm not sure why it's like that but you can create a custom hook that saves the returncode https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/apache/spark/hooks/spark_submit.py#L438 and then utalize that. A PR to the open source to address it is also welcome :) – Elad Kalif Jul 04 '21 at 12:32
Exactly, that is what have I just done: my own custom hook with several lines added to handle skip_exit_code, similar to what BashOperator is doing. – Alexey Novakov Jul 04 '21 at 12:58
1

so yeah there is no out of the box solution. You have to code it with the options presented. – Elad Kalif Jul 04 '21 at 13:59

How to skip Airflow SparkSubmitOperator task based on exit code that my Spark programm returns?

1 Answers1