1

We're running Airflow cluster using puckel/airflow docker image with docker-compose. Airflow's scheduler container outputs its logs to /usr/local/airflow/logs/scheduler.

The problem is that the log files are not rotated and disk usage increases until the disk gets full. Dag for cleaning up the log directory is available but the DAG run on worker node and log directory on scheduler container is not cleaned up.

I'm looking for the way to output scheduler log to stdout or S3/GCS bucket but unable to find out. Is there any to output the scheduler log to stdout or S3/GCS bucket?

toshiya
  • 183
  • 2
  • 13

3 Answers3

3

Finally I managed to output scheduler's log to stdout.

Here you can find how to use custom logger of Airflow. The default logging config is available at github.

What you have to do is.

(1) Create custom logger class to ${AIRFLOW_HOME}/config/log_config.py.


# Setting processor (scheduler, etc..) logs output to stdout
# Referring https://www.astronomer.io/guides/logging
# This file is created following https://airflow.apache.org/docs/apache-airflow/2.0.0/logging-monitoring/logging-tasks.html#advanced-configuration

from copy import deepcopy
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
import sys

LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG)
LOGGING_CONFIG["handlers"]["processor"] = {
    "class": "logging.StreamHandler",
    "formatter": "airflow",
    "stream": sys.stdout,
}

(2) Set logging_config_class property to config.log_config.LOGGING_CONFIG in airflow.cfg

logging_config_class = config.log_config.LOGGING_CONFIG

(3) [Optional] Add $AIRFLOW_HOME to PYTHONPATH environment.

export "${PYTHONPATH}:~"
  • Actually, you can set the path of logging_config_class to anything as long as the python is able to load the package.
  • Setting handler.processor to airflow.utils.log.logging_mixin.RedirectStdHandler didn't work for me. It used too much memory.
toshiya
  • 183
  • 2
  • 13
  • Many thanks for pointing this out. It helped me to configure airflow tasks to log to stdout using k8s executor. Next comment includes my helm values (just in case anyone would find it helpful) (sorry about formatting, no idea how to paste code block into a comment). – faja Oct 19 '21 at 12:29
  • 1
    ```yaml config: logging: logging_config_class: airflow_local_settings.LOGGING_CONFIG airflowLocalSettings: |- from copy import deepcopy from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG import sys LOGGING_CONFIG = deepcopy(DEFAULT_LOGGING_CONFIG) LOGGING_CONFIG['handlers']['stdout'] = { 'class': 'logging.StreamHandler', 'stream': sys.stdout, 'formatter': 'airflow', 'filters': ['mask_secrets'], } LOGGING_CONFIG['loggers']['airflow.task']['handlers'] = [ 'stdout', 'task', ]``` – faja Oct 19 '21 at 12:38
  • @faja if that worked for you, can you explain it a bit more and make an answer out of it? Seems to me to be a good alternative to what toshiya did – Danielson Dec 23 '21 at 06:13
0

remote_logging=True in airflow.cfg is the key. Please check the thread here for detailed steps.

Sharif
  • 194
  • 2
  • 12
  • 1
    `remote_logging=True` outputs *task* logs to S3/GCS. My problem is outputting *scheduler* logs to stdout or S3/GCS instead of to the local file. – toshiya Jan 11 '21 at 05:52
0

You can extend the image with the following or do so in airflow.cfg

ENV AIRFLOW__LOGGING__REMOTE_LOGGING=True
ENV AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID=gcp_conn_id
ENV AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER=gs://bucket_name/AIRFLOW_LOGS

the gcp_conn_id should have the correct permission to create/delete objects in GCS

CrookedCloud
  • 116
  • 1
  • 4