0

I am playing with Amazon Managed Workflows for Apache Airflow (MWAA) for the first time, so I could be missing some basics.

If I had a python application, which I had organised/spread across 2 files/scripts so it looked like this:

my_script_1.py

import my_script_2

print('This is script 1')
my_script_2.print_me()

my_script_2.py

def print_me():
    print('I am script 2')

When this runs locally I get:

% python my_script_1.py
This is script 1
I am script 2

How would I deploy/organise this for MWAA if I want to run my_script_1.py and have it call/invoke my_script_2.py?

If I turn my_script_1.py into a DAG and upload the DAG to the DAG folder on my S3 bucket, how do I provide my_script_2.py to MWAA.

my_script_2.py:

  1. is not a python library on a library repo like pypi.org
  2. is not a wheel file Python wheels (.whl)
  3. is not a hosted on a private PyPi/PEP-503 Compliant Repo on your environment

I got that list from Python dependencies overview.

Is the solution to just upload my_script_2.py into the DAG folder on my S3 bucket, and expect MWAA to replicate my_script_2.py to a location/path which my_script_1.py will have access to at run time?

Does this link from airflow.apache.org apply to MWAA environments Typical structure of packages?

MattG
  • 5,589
  • 5
  • 36
  • 52

1 Answers1

0

Python scripts called by other python scripts can be copied to the S3 bucket DAG folder and they will be accessible. Here is a working example:

test_import.py

from datetime import datetime, timedelta
import my_script_1

# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG

# Operators; we need this to operate!
from airflow.operators.python_operator import PythonOperator


def invoke_my_python_scripts():
    print(f"This is printed from the dag script: {__file__}")
    my_script_1.print_me()


# These args will get passed on to each operator
# You can override them on a per-task basis during operator initialization
default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "email": ["airflow@example.com"],
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}
with DAG(
    "dag_test_import",
    default_args=default_args,
    description="A simple tutorial DAG",
    schedule_interval=timedelta(days=1),
    start_date=datetime(2021, 1, 1),
    catchup=False,
    tags=["example"],
) as dag:
    my_python_task = PythonOperator(
        task_id="python_task_id_1", dag=dag, python_callable=invoke_my_python_scripts,
    )

my_script_1.py

import my_script_2

def print_me():
    print("This is script 1")
    my_script_2.print_me()

my_script_2.py

def print_me():
    print("I am script 2")

I copied these to my AWS S3 bucket dag folder:

% aws s3 cp test_import.py  s3://mwaa/dags/test_import.py
upload: ./test_import.py to s3://mwaa/dags/test_import.py

% aws s3 cp my_script_1.py  s3://mwaa/dags/my_script_1.py
upload: ./my_script_1.py to s3://mwaa/dags/my_script_1.py

% aws s3 cp my_script_2.py  s3://mwaa/dags/my_script_2.py
upload: ./my_script_2.py to s3://mwaa/dags/my_script_2.py

The MWAA log shows the output from all 3 scripts, so it is working:

[2022-05-12, 10:52:58 UTC] {{logging_mixin.py:109}} INFO - This is printed from the dag script: /usr/local/airflow/dags/test_import.py
[2022-05-12, 10:52:58 UTC] {{logging_mixin.py:109}} INFO - This is script 1
[2022-05-12, 10:52:58 UTC] {{logging_mixin.py:109}} INFO - I am script 2
[2022-05-12, 10:52:58 UTC] {{python.py:152}} INFO - Done. Returned value was: None
MattG
  • 5,589
  • 5
  • 36
  • 52