2

How to retrieve the yarn_application_id from the SparkSubmitHook ? I tried to using a custom operator and the task_instance property but I guess I missed something...

def task_failure_callback(context):
    task_instance = context.get('task_instance')  # Need to access yarn_application_id here
    operator = task_instance.operator
    application_id = operator.yarn_application_id
    return ...

default_args = {
    'start_date': ...,
    'on_failure_callback': task_failure_callback
}

with DAG(DAG_ID, default_args=default_args, catchup=CATCHUP, schedule_interval=SCHEDULE_INTERVAL) as dag:
    ...

So I tried adding it as a new key-value in the context dict, but without success...

class CustomSparkSubmitHook(SparkSubmitHook, LoggingMixin):
    def __init__(self, ...):
        super().__init__(...)

    def submit_with_context(self, context, application="", **kwargs):
        # Build spark submit cmd
        ...
        # Run cmd as subprocess
        ...
        # Process spark submit log
        ...
        # Check spark-submit return code. In Kubernetes mode, also check the value
        # of exit code in the log, as it may differ.
        ...

        # We want the Airflow job to wait until the Spark driver is finished
        if self._should_track_driver_status:
            if self._driver_id is None:
                raise AirflowException(
                    "No driver id is known: something went wrong when executing " +
                    "the spark submit command"
                )

            # We start with the SUBMITTED status as initial status
            self._driver_status = "SUBMITTED"

            # Trying to export yarn_application_id unsuccessfully
            context['yarn_application_id'] = self.yarn_application_id

            # Start tracking the driver status (blocking function)
            ...

    @property
    def yarn_application_id(self):
        return self._yarn_application_id
belgacea
  • 1,084
  • 1
  • 15
  • 33
  • Your question is quite unclear. Could you clarify where you need to access `yarn_application_id` and include any error traceback you're getting with your current approach? – PirateNinjas Nov 05 '19 at 14:11
  • @PirateNinjas I don't face any error. I just can't find a way to retrieve `yarn_application_id`. I edited my question. – belgacea Nov 07 '19 at 10:01
  • Did you ever find an answer to this? I'm interested as well. I use an operator that sets one of its attributes as a uuid, and I'm trying to grab that uuid in a future task. – ZaxR Jan 06 '22 at 20:23
  • @ZaxR I had to leave that out because of something else. Now I don't use Airflow anymore, but I guess you could try something like this `context['task'].yarn_application_id` or `context['ti'].yarn_application_id ` from this answer https://stackoverflow.com/a/51167438/6450431 – belgacea Jan 07 '22 at 02:53

0 Answers0