0

I have created a on_failure_callback function(refering Airflow default on_failure_callback) to handle task's failure.

It works well when there is only one task in a DAG, however, if there are 2 more tasks, a task is randomly failed since the operator is null, it can resume later by manully . In airflow-scheduler.out the log is:

[2018-05-08 14:24:21,237] {models.py:1595} ERROR - Executor reports task instance %s finished (%s) although the task says its %s. Was the task killed externally? NoneType [2018-05-08 14:24:21,238] {jobs.py:1435} ERROR - Cannot load the dag bag to handle failure for . Setting task to FAILED without callbacks or retries. Do you have enough resources?

The DAG code is:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
import airflow
from devops.util import WechatUtil
from devops.util import JiraUtil

def on_failure_callback(context):
    ti = context['task_instance']
    log_url = ti.log_url
    owner = ti.task.owner
    ti_str = str(context['task_instance'])
    wechat_msg = "%s - Owner:%s"%(ti_str,owner)
    WeChatUtil.notify_team(wechat_msg)

    jira_desc = "Please check log from url %s"%(log_url)
    JiraUtil.create_incident("DW",ti_str,jira_desc,owner)


args = {
    'queue': 'default',
    'start_date': airflow.utils.dates.days_ago(1),
    'retry_delay': timedelta(minutes=1),
    'on_failure_callback': on_failure_callback,
    'owner': 'user1',
    }
dag = DAG(dag_id='test_dependence1',default_args=args,schedule_interval='10 16 * * *')

load_crm_goods = BashOperator(
    task_id='crm_goods_job',
    bash_command='date',
    dag=dag)

load_crm_memeber = BashOperator(
    task_id='crm_member_job',
    bash_command='date',
    dag=dag)

load_crm_order = BashOperator(
    task_id='crm_order_job',
    bash_command='date',
    dag=dag)

load_crm_eur_invt = BashOperator(
    task_id='crm_eur_invt_job',
    bash_command='date',
    dag=dag)

crm_member_cohort_analysis = BashOperator(
    task_id='crm_member_cohort_analysis_job',
    bash_command='date',
    dag=dag)

crm_member_cohort_analysis.set_upstream(load_crm_goods)
crm_member_cohort_analysis.set_upstream(load_crm_memeber)
crm_member_cohort_analysis.set_upstream(load_crm_order)
crm_member_cohort_analysis.set_upstream(load_crm_eur_invt)

crm_member_kpi_daily = BashOperator(
    task_id='crm_member_kpi_daily_job',
    bash_command='date',
    dag=dag)

crm_member_kpi_daily.set_upstream(crm_member_cohort_analysis)

I had tried to update the airflow.cfg by adding the default memory from 512 to even 4096, but no luck. Would anyone have any advice ?

Ialso try to updated my JiraUtil and WechatUtil as following, encoutering the same error

WechatUtil:

import requests

class WechatUtil:
    @staticmethod
    def notify_trendy_user(user_ldap_id, message):
        return None

    @staticmethod
    def notify_bigdata_team(message):
        return None

JiraUtil:

import json
import requests
class JiraUtil:
    @staticmethod
    def execute_jql(jql):
        return None

    @staticmethod
    def create_incident(projectKey, summary, desc, assignee=None):
        return None
Lin Forest
  • 107
  • 8

1 Answers1

0

(I'm shooting tracer bullets a bit here, so bear with me if this answer doesn't get it right on the first try.)

The null operator issue with multiple task instances is weird... it would help approaching troubleshooting this if you could boil the current code down to a MCVE e.g., 1–2 operators and excluding the JiraUtil and WechatUtil parts if they're not related to the callback failure.

Here are 2 ideas:

1. Can you try changing the line that fetches the task instance out of the context to see if this makes a difference?

Before:

def on_failure_callback(context):
    ti = context['task_instance']
    ...

After:

def on_failure_callback(context):
    ti = context['ti']
    ...

I saw this usage in the Airflow repo (https://github.com/apache/incubator-airflow/blob/c1d583f91a0b4185f760a64acbeae86739479cdb/airflow/contrib/hooks/qubole_check_hook.py#L88). It's possible it can be accessed both ways.

2. Can you try adding provide_context=True on the operators either as a kwarg or in default_args?

Taylor D. Edmiston
  • 12,088
  • 6
  • 56
  • 76