My Spark submit application is doing some query and returning different exit code depends on the dataset state.
Is it possible to skip downstream tasks right after my spark-submit operator? I am thinking about skip_exit_code
feature of BashOperator, which is surprisingly missing in all other operators.
def spark_job(task_id: str, cfg: ArgList, main_class: str, dag: DAG) -> BaseOperator:
copy_args = cfg.to_arg_list()
return SparkSubmitOperator(
task_id=task_id,
conn_id='spark_default',
java_class=main_class,
application=SPARK_JOBS_JAR,
application_args=copy_args,
total_executor_cores='2',
executor_cores='1',
executor_memory='1g',
num_executors='1',
name=task_id,
verbose=False,
driver_memory='1g',
dag=dag
)
cfg = CheckDataCfg(...)
check_data_task = spark_job('check-data', cfg, 'etljobs.spark.CheckDataRecieved', dag)
check_data_task >> '<my next task which I need to skip sometimes>'
UPDATE:
Current SparkSubmitHook
implementation does throw an exception if returncode
is not 0. So there are only two workarounds which I found later:
- Create custom
SparkSubmitHook
andSparkSubmitOperator
classes to ignore user-defined non-zero exit codes in way that eitherAirflowSkipException
exception is going to be thrown or return code value will be pushed to XCom for further use. - Use
BashOperator
instead. It already support skip_exit_code feature. You would need to construct all CLI Spark args manually in Python which is not a big deal.