Co-locating tasks within Airflow

Question

I am evaluating Airflow for a use case where, for starters, we need to:

download data from S3 (gigabytes)
transform data using tools packaged as docker images, working with plain files
upload data again to S3

Workers would be running in AWS, with autoscaling.

I have implemented #1 and #2 on my machine with something like this:

def download_from_s3(key: str, bucket_name: str, local_path: str) -> None:
    hook = S3Hook('aws_default')
    hook.download_file(key=key, bucket_name=bucket_name, local_path=local_path, preserve_file_name=True, use_autogenerated_subdir=False)


with DAG(...) as dag:

    download_from_s3_task = PythonOperator(
        task_id='download_from_s3',
        python_callable=download_from_s3,
        op_kwargs={
            'key': 'bigfile.dat',
            'bucket_name': 'mybucket',
            'local_path': '/data/',
        }
    )

    process_docker = DockerOperator(
        task_id="process_docker",
        image="mytooldockerimage:1.0.0",
        command=[
            "-f", "/data/bigfile.dat",
            "-o", "/data/out/results.dat",
        ]),
        mounts=[Mount("/data", "/data", "bind")]
    )

But I fear that operators result in tasks that could be scheduled on any worker machine, so filesystem access cannot be guaranteed across tasks in a production environment.

If this is the case, the cool Airflow operators above are usable. Do I need to rewrite the ETL to do all local file processing with a single task or can a group of operators run as a single unit in a worker?

score 0 · Answer 1 · answered Dec 01 '22 at 15:21

0

It depends on your usecase and what Airflow architecture/scheduler are you using. You can still use your local storage and mount it to your containter if you indent do use DockerOperator (https://airflow.apache.org/docs/apache-airflow-providers-docker/stable/_api/airflow/providers/docker/operators/docker/index.html) however this approach would not be possible if your Airflow instance will be running on kubernetes.

transform data using tools packaged as docker images, working with plain files

Did you think about rewriting "transform" part of your ETL so it would be possible to read data from S3, not just local FS?

answered Dec 01 '22 at 15:21

Michal Volešíni

99
6

Thanks Michal, the issue is that the tools are packaged as docker in this form by others, so I would need to rewrite the ETL to call something, that would download, then invoke docker internally. It is my understanding that I cannot have a python operator and docker operator run as a single task. – Andrea Ratto Dec 01 '22 at 15:41
1

Maybe I am missing something but in case you can overcome problem with running scheduler on multiple machines and solve how to share a filesystem among them - this all depends on your airflow installation you can use python (or any operator) to extract your data and save them into your local FS. Then for the transform part, you can mount this location to your docker operator and transform data and then push this data to S3 using another docker/python whatever operator. However, I would advice you to use local filesystem if you have possibility to work with s3, hdfs or any remote storage – Michal Volešíni Dec 01 '22 at 16:16
Thanks again, I have updated the question to make it clearer. I hope you can take another look – Andrea Ratto Dec 01 '22 at 16:47
Thanks for editing, its more clear now. What about mounting S3 into your `mytooldockerimage` . It is not supported natively but there is a solution described here https://stackoverflow.com/a/59754773/20646982 So what about wrap your `mytooldockerimage` into new Dockerfile and starts with FROM mytooldockerimage:1.0.0 ... (here goes the rest of Dockerfile definition from stackoverflow post I mentioned) In that case you just need one Dockeroperator and dont need step 1 and step 3 of you desired flow. Let me know how that goes or if you came up with another solution. – Michal Volešíni Dec 02 '22 at 08:36
Well, everything is possible, but I am trying to get an authoritative answer about airflow, I am not going to adopt a tool just to fight it. – Andrea Ratto Dec 02 '22 at 09:46

Co-locating tasks within Airflow

1 Answers1