4

I am wrestling with ingesting Apache Airflow logs into Elasticsearch, using Logstash filters to parse the log lines. One thing that I am struggling with getting my head around how to do appropriately is to handle cases where log lines are nested, e.g. if a workflow logging from within a task. For instance, one log line might look like this:

[2020-01-28 20:23:21,341] {{base_task_runner.py:115}} INFO - Job 389: Subtask delete_consumptiondata [2020-01-28 20:23:21,341] {{cli.py:545}} INFO - Running <TaskInstance: azureconsumption_usage-1.1.delete_consumptiondata 2020-01-27T00:00:00+00:00 [running]> on host devaf1-dk1.sys.dom

Is there anyone who have thoughts on what might be an appropriate way - or even better, experience handling nested log lines such as this?

1 Answers1

1

You can use this to parse the common format of the logs, if you want to add more specific parsers, use this site https://grokdebug.herokuapp.com/ for testing it

input {
  file {
    path => "/airflow/logs/*/*/*/*.log"
  }
}

filter {
  grok {
    match => {"path" => "/airflow/logs/(?<dag_id>.*?)/(?<task_id>.*?)/(?<execution_date>.*?)/(?<try_number>.*?).log$"}
    match => {"message" => "%{TIMESTAMP_ISO8601:timestamp_matched}. ..%{USERNAME:file}\:%{NUMBER:line}.. %{WORD:log_level}[- ]{3}%{GREEDYDATA:message}"}
    break_on_match => false
    overwrite => [ "message" ]
  }

  mutate {
      add_field => {
        "log_id" => "%{[dag_id]}-%{[task_id]}-%{[execution_date]}-%{[try_number]}"
      }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
  }
}