0

I'm used to running pipelines via AWS data pipelines but getting familiar with Airflow (Cloud Composer).

In data pipelines we would:

  • Spawn a task runner,
  • Bootstrap it,
  • Do work,
  • Kill the task runner.

I just realized that my airflow runners are not ephemeral. I touched a file in /tmp, did it again in a separate DagRun, then listed /tmp and found two files. I expected only the one I most recently touched.

This seems to mean I need to watch out how much "stuff" is being stored locally on the runner.

I know GCS mounts the /data folder with FUSE so I'm defaulting to storing a lot of my working files there, and moving files from there to final buckets elsewhere, but how do you approach this? What would be "best practice"?

Thanks for the advice.

JW2
  • 349
  • 2
  • 16
  • As you mentioned GSC bucket `/data` is mounted in `/home/airflow/gcs/data` folder on workers nodes, bi-directionally [synced](https://cloud.google.com/composer/docs/concepts/cloud-storage#data_synchronization) using FUSE feature, assuming this you can adopt [gcs_to_gcs](https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/gcs_to_gcs/index.html) DAG operator in the particular Airflow task, copying the data from one bucket to the other. Please explain more about your general aim if my point is not accurate here. – Nick_Kh Mar 20 '20 at 10:16
  • Does saving files in `/home/airflow/gcs/data` take up disk space on the worker? I ran a task that unzips a lot of files. In the first run I was not deleting the files when I was finished with them and I noticed the workers ultimately crashed because of memory usage. – JW2 Mar 22 '20 at 13:25
  • This is mounted folder, therefore can't consume worker disk space, find more info in Airflow [Capacity considerations](https://cloud.google.com/composer/docs/concepts/cloud-storage#capacity_considerations) chapter. – Nick_Kh Mar 23 '20 at 08:02
  • That aligns with my understanding of the documentation - looking back at my script I was actually writing to `/tmp` incorrectly. My original question still stands though, it seems like bad practice to keep all of your data that you're processing in `/home/airflow/gcs/data`. Anytime you start a new instance of composer this data will be in another bucket from the old instance. So I am guessing it's "best practice" to use the /home/airflow/gcs/data` bucket for processing and move the final results to a more permanent bucket elsewhere. – JW2 Mar 24 '20 at 13:14
  • If you don't expect to save some staging or historical data in this GCS bucket to re-run tasks in future , then you can move this resulted data somewhere for longer time usage purpose. – Nick_Kh Mar 26 '20 at 08:01

1 Answers1

3

Cloud Composer currently uses CeleryExecutor, which configures persistent worker processes that handle the execution of task instances. As you have discovered, you can make changes to the filesystems of the Airflow workers (which are Kubernetes pods), and they will indeed persist until the pod is restarted/replaced.

Best-practice wise, you should consider the local filesystem to be ephemeral to the task instance's lifetime, but you shouldn't expect that it will clean up for you. If you have tasks that perform heavy I/O, you should perform them outside of /home/airflow/gcs because that path is network mounted (GCSFUSE), but if there is final data you want to persist, then you should write it to /data.

hexacyanide
  • 88,222
  • 31
  • 159
  • 162
  • Gotcha, this helps. I’m used to the task runners actually being ephemeral, and not having to worry about cleanup. It sounds like you could easily fill up disk on a node if you’re not careful to cleanup after yourself. – JW2 May 20 '20 at 15:15