0

Previously, a similar question was asked how-to-programmatically-set-up-airflow-1-10-logging-with-localstack-s3-endpoint but it wasn't solved.

I have Airflow running in Docker container which is setup using docker-compose, I followed this guide. Now I want to download some data from an S3 bucket but I need to setup the credentials to allow that. Everywhere this only seems to be done using the UI by manually setting the AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY which exposes these in the UI, I want to set this up in the code itself by reading in the ENV variables. In boto3 this would be done using:

import boto3
session = boto3.Session(
    aws_access_key_id=settings.AWS_SERVER_PUBLIC_KEY,
    aws_secret_access_key=settings.AWS_SERVER_SECRET_KEY,
)

So how would I do this in the code for the DAGS?

Code:

import traceback
import airflow
from airflow import DAG
from airflow.exceptions import AirflowFailException
from airflow.operators.python import PythonOperator

from airflow.providers.amazon.aws.hooks.s3 import S3Hook

def _download_s3_data(templates_dict, **context):
    # contains a list of the values returned
    data = templates_dict.get("sagemaker_autopilot_data")
    if any([not paths for paths in data]):
        raise AirflowFailException("Some of the paths were not passed!")
    else:
        (
            sagemaker_training,
            sagemaker_testing,
        ) = data
        s3hook = S3Hook()
        # parse the s3 url
        bucket_name, key = s3hook.parse_s3_url(s3url=sagemaker_training)
        try:
            # need aws credentials
            file_name = s3hook.download_file(key=key, bucket_name=bucket_name)
        except:
            traceback.print_exc()
            raise AirflowFailException("Error downloading s3 file")

ENV file:

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

Edit:

Amazon Web Services Connection seems to be the only documentation about it but its kinda confusing and doesn't mention how to do this programatically.

yudhiesh
  • 6,383
  • 3
  • 16
  • 49

2 Answers2

2

The S3Hook takes aws_conn_id as parameter. You simply need to define the connection once for your airflow installation and then you will be able to use that connection in your hook.

Default name of the connection is aws_default (see https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/connections/aws.html#default-connection-ids). Simply create the connection first (or edit if it is already there) - either via Airflow UI or via environment variable or via Secret Backends

Here is the documentation describing all the options you can use:

https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html

As described in https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/connections/aws.html - login in the connection is used as AWS_ACCESS_KEY_ID and password is used as AWS_SECRET_ACCESS_KEY, but AWS connection in the UI of Airflow is customized and it shows hints and options via custom fields, so you can easily start with the UI.

Once you have the connection defined, S3 Hook will read the credentials stored in the connection it uses (so by default: aws_default). You can also define multiple AWS connections with different IDs and pass those connection ids as aws_conn_id parameter when you create hoook.

Jarek Potiuk
  • 19,317
  • 2
  • 60
  • 61
2

Just to add on to @Jarek Potiuk's answer this is what I ended up doing.

1. Create an .env file with the following variables

AIRFLOW_UID and AIRFLOW_GID are obtained by running the following bash command echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env. The other AWS variables are specific to temporary credentials I am using.

You can exclude REGION_NAME if you want my credentials are based on this specific region.

AIRFLOW_UID=
AIRFLOW_GID=
REGION_NAME=us-east-1
AWS_SESSION_TOKEN=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

2. Add env variables to docker-compose.yaml

I used this template which you can directly get by running curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.1.2/docker-compose.yaml' in your root dir.

version: "3"
x-airflow-common: &airflow-common
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.2}
  environment: &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ""
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: "true"
    AIRFLOW__CORE__LOAD_EXAMPLES: "true"
    AIRFLOW__API__AUTH_BACKEND: "airflow.api.auth.backend.basic_auth"
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    # Add env variables here!
    AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
    AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    AWS_SESSION_TOKEN: ${AWS_SESSION_TOKEN}
    REGION_NAME: ${REGION_NAME}

3. Then the first part of the DAG is to run a PythonOperator that sets up the connection using the Generating a connection URI guide.

From here just link all the other task that you intend to run.

import os
import json
from airflow.models.connection import Connection
from airflow.exceptions import AirflowFailException

def _create_connection(**context):
    """
    Sets the connection information about the environment using the Connection
    class instead of doing it manually in the Airflow UI
    """
    AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
    AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
    AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
    REGION_NAME = os.getenv("REGION_NAME")
    credentials = [
        AWS_SESSION_TOKEN,
        AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY,
        REGION_NAME,
    ]
    if not credentials or any(not credential for credential in credentials):
        raise AirflowFailException("Environment variables were not passed")

    extras = json.dumps(
        dict(
            aws_session_token=AWS_SESSION_TOKEN,
            aws_access_key_id=AWS_ACCESS_KEY_ID,
            aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
            region_name=REGION_NAME,
        ),
    )
    try:
        Connection(
            conn_id="s3_con",
            conn_type="S3",
            extra=extras,
        )
    except Exception as e:
        raise AirflowFailException(
            f"Error creating connection to Airflow :{e}",
        )
yudhiesh
  • 6,383
  • 3
  • 16
  • 49