I am evaluating Airflow for a use case where, for starters, we need to:
- download data from S3 (gigabytes)
- transform data using tools packaged as docker images, working with plain files
- upload data again to S3
Workers would be running in AWS, with autoscaling.
I have implemented #1 and #2 on my machine with something like this:
def download_from_s3(key: str, bucket_name: str, local_path: str) -> None:
hook = S3Hook('aws_default')
hook.download_file(key=key, bucket_name=bucket_name, local_path=local_path, preserve_file_name=True, use_autogenerated_subdir=False)
with DAG(...) as dag:
download_from_s3_task = PythonOperator(
task_id='download_from_s3',
python_callable=download_from_s3,
op_kwargs={
'key': 'bigfile.dat',
'bucket_name': 'mybucket',
'local_path': '/data/',
}
)
process_docker = DockerOperator(
task_id="process_docker",
image="mytooldockerimage:1.0.0",
command=[
"-f", "/data/bigfile.dat",
"-o", "/data/out/results.dat",
]),
mounts=[Mount("/data", "/data", "bind")]
)
But I fear that operators result in tasks that could be scheduled on any worker machine, so filesystem access cannot be guaranteed across tasks in a production environment.
If this is the case, the cool Airflow operators above are usable. Do I need to rewrite the ETL to do all local file processing with a single task or can a group of operators run as a single unit in a worker?