5

In attempt to setup airflow logging to localstack s3 buckets, for local and kubernetes dev environments, I am following the airflow documentation for logging to s3. To give a little context, localstack is a local AWS cloud stack with AWS services including s3 running locally.

I added the following environment variables to my airflow containers similar to this other stack overflow post in attempt to log to my local s3 buckets. This is what I added to docker-compose.yaml for all airflow containers:

       - AIRFLOW__CORE__REMOTE_LOGGING=True
       - AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://local-airflow-logs
       - AIRFLOW__CORE__REMOTE_LOG_CONN_ID=MyS3Conn
       - AIRFLOW__CORE__ENCRYPT_S3_LOGS=False

I've also added my localstack s3 creds to airflow.cfg

[MyS3Conn]
aws_access_key_id = foo
aws_secret_access_key = bar
aws_default_region = us-east-1
host = http://localstack:4572    # s3 port. not sure if this is right place for it 

Additionally, I've installed apache-airflow[hooks], and apache-airflow[s3], though it's not clear which one is really needed based on the documentation.

I've followed the steps in a previous stack overflow post in attempt verify if the S3Hook can write to my localstack s3 instance:

from airflow.hooks import S3Hook
s3 = S3Hook(aws_conn_id='MyS3Conn')
s3.load_string('test','test',bucket_name='local-airflow-logs')

But I get botocore.exceptions.NoCredentialsError: Unable to locate credentials.

After adding credentials to airflow console under /admin/connection/edit as depicted: enter image description here this is the new exception, botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. is returned. Other people have encountered this same issue and it may have been related to networking.

Regardless, a programatic setup is needed, not a manual one.

I was able to access the bucket using a standalone Python script (entering AWS credentials explicitly with boto), but it needs to work as part of airflow.

Is there a proper way to set up host / port / credentials for S3Hook by adding MyS3Conn to airflow.cfg?

Based on the airflow s3 hooks source code, it seems a custom s3 URL may not yet be supported by airflow. However, based on the airflow aws_hook source code (parent) it seems it should be possible to set the endpoint_url including port, and it should be read from airflow.cfg.

I am able to inspect and write to my s3 bucket in localstack using boto alone. Also, curl http://localstack:4572/local-mochi-airflow-logs returns the contents of the bucket from the airflow container. And aws --endpoint-url=http://localhost:4572 s3 ls returns Could not connect to the endpoint URL: "http://localhost:4572/".

What other steps might be needed to log to localstack s3 buckets from airflow running in docker, with automated setup and is this even supported yet?

oasisPolo
  • 138
  • 7

2 Answers2

3

I think you're supposed to use localhost not localstack for the endpoint, e.g. host = http://localhost:4572.

In Airflow 1.10 you can override the endpoint on a per-connection basis but unfortunately it only supports one endpoint at a time so you'd be changing it for all AWS hooks using the connection. To override it, edit the relevant connection and in the "Extra" field put:

{"host": "http://localhost:4572"}

I believe this will fix it?

Diego
  • 51
  • 5
  • This helped me in solving the problem with another provider. – Alexander Bogushov May 28 '20 at 10:38
  • I confirm I got this working with Airflow v2.2.1. To be 100% clear, the "host" field in the form I left empty and the above json is put in the "Extra" field. I also had to put a nonsense login and password to avoid a `NoCredentialsError` exception – Greg Nov 23 '21 at 10:19
0

I managed to make this work by referring to this guide. Basically you need to create a connection using the Connection class and pass the credentials that you need, in my case I needed AWS_SESSION_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, REGION_NAME to make this work. Use this function as a python_callable in a PythonOperator which should be the first part of the DAG.

import os
import json
from airflow.models.connection import Connection
from airflow.exceptions import AirflowFailException

def _create_connection(**context):
    """
    Sets the connection information about the environment using the Connection
    class instead of doing it manually in the Airflow UI
    """
    AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
    AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
    AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
    REGION_NAME = os.getenv("REGION_NAME")
    credentials = [
        AWS_SESSION_TOKEN,
        AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY,
        REGION_NAME,
    ]
    if not credentials or any(not credential for credential in credentials):
        raise AirflowFailException("Environment variables were not passed")

    extras = json.dumps(
        dict(
            aws_session_token=AWS_SESSION_TOKEN,
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=REGION_NAME,
        ),
    )
    try:
        Connection(
            conn_id="s3_con",
            conn_type="S3",
            extra=extras,
        )
    except Exception as e:
        raise AirflowFailException(
            f"Error creating connection to Airflow :{e!r}",
        )
yudhiesh
  • 6,383
  • 3
  • 16
  • 49
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/low-quality-posts/29623455) – dm2 Aug 19 '21 at 16:30
  • 1
    @dm2 alright I updated the answer to be more descriptive. – yudhiesh Aug 20 '21 at 12:52