37

My questions :

nono
  • 2,262
  • 3
  • 23
  • 32

3 Answers3

50

I use something like this.

  • A project is normally something completely separate or unique. Perhaps DAGs to process files that we receive from a certain client which will be completely unrelated to everything else (almost certainly a separate database schema)
  • I have my operators, hooks, and some helper scripts (delete all Airflow data for a certain DAG, etc.) in a common folder
  • I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project (makes it more organized and easier to grant permissions on Gitlab since projects are so unrelated). This means that each project folder also as a .git and .gitignore, etc as well
  • I tend to save the raw data and then 'rest' a modified copy of the data which is exactly what gets copied into the database. I have to heavily modify some of the raw data due to different formats from different clients (Excel, web scraping, HTML email scraping, flat files, queries from SalesForce or other database sources...)

Example tree:

├───dags
│   ├───common
│   │   ├───hooks
│   │   │       pysftp_hook.py
│   │   │
│   │   ├───operators
│   │   │       docker_sftp.py
│   │   │       postgres_templated_operator.py
│   │   │
│   │   └───scripts
│   │           delete.py
│   │
│   ├───project_1
│   │   │   dag_1.py
│   │   │   dag_2.py
│   │   │
│   │   └───sql
│   │           dim.sql
│   │           fact.sql
│   │           select.sql
│   │           update.sql
│   │           view.sql
│   │
│   └───project_2
│       │   dag_1.py
│       │   dag_2.py
│       │
│       └───sql
│               dim.sql
│               fact.sql
│               select.sql
│               update.sql
│               view.sql
│
└───data
    ├───project_1
    │   ├───modified
    │   │       file_20180101.csv
    │   │       file_20180102.csv
    │   │
    │   └───raw
    │           file_20180101.csv
    │           file_20180102.csv
    │
    └───project_2
        ├───modified
        │       file_20180101.csv
        │       file_20180102.csv
        │
        └───raw
                file_20180101.csv
                file_20180102.csv

Update October 2021. I have a single repository for all projects now. All of my transformation scripts are in the plugins folder (which also contains hooks and operators - basically any code which I import into my DAGs). DAG code I try to keep pretty bare so it basically just dictates the schedules and where data is loaded to and from.

├───dags
│   │
│   ├───project_1
│   │     dag_1.py
│   │     dag_2.py
│   │
│   └───project_2
│         dag_1.py
│         dag_2.py
│
├───plugins
│   ├───hooks
│   │      pysftp_hook.py
|   |      servicenow_hook.py
│   │   
│   ├───sensors
│   │      ftp_sensor.py
|   |      sql_sensor.py
|   |
│   ├───operators
│   │      servicenow_to_azure_blob_operator.py
│   │      postgres_templated_operator.py
│   |
│   ├───scripts
│       ├───project_1
|       |      transform_cases.py
|       |      common.py
│       ├───project_2
|       |      transform_surveys.py
|       |      common.py
│       ├───common
|             helper.py
|             dataset_writer.py
| .airflowignore
| Dockerfile
| docker-stack-airflow.yml
trench
  • 5,075
  • 12
  • 50
  • 80
  • 1
    I find this directory structure to be very useful; just one doubt: would **performance of `scheduler`** be impacted by placing everything under `dag` directory? Although even if I place files containing `operator`s (or anything except actual `DAG`s) outside `dag` directory and import them in `DAG` files, it would mean more-or-less the same thing. However non-`python` files (like `.sql` files above) inside `dag` folder can (in theory) cause unnecessary overhead for `scheduler` – y2k-shubham Jul 24 '18 at 08:55
  • Oh, I am not sure. I will have to look into it though. Honestly never crossed my mind - I set it up this way more due to version control (easy to clone or share a specific project end to end and not mix up projects) – trench Jul 24 '18 at 15:10
  • 1
    `I used to have a single git repository for the entire Airflow folder, but now I have a separate git per project` how are you then gluing the different repos together in the main repo ? (my current solution is using git submodules) – louis_guitton Mar 27 '19 at 14:39
  • 3
    Just following up on this thread. Has anyone see any impact on the scheduler from this repository structure? – Nadim Younes Jun 17 '20 at 14:17
  • 1
    curious to see a sample of your code – Renz Carillo Jan 21 '22 at 14:35
  • I always got trouble with my local env (docker) and deplot¡y to prod. What's your root path? It always force me to add "airflow" if not, its not recognized by the dags when the scheduler loads the files. – Santiago P Apr 25 '22 at 08:59
  • In my opinion this structure is good, but when your project gets bigger the scripts folder should become a project added to the pythonpath (or installed as a python package) to follow a clean architecture and scale better – Fran Arenas Aug 31 '22 at 07:50
  • That would make sense. Do you just edit the Dockerfile to add something new to the pythonpath? I also have new folder in plugins called 'tasks' which has TaskFlow API functions that I import these days, instead of traditional operators. – trench Sep 05 '22 at 19:41
21

I would love to benchmark folder structure with other people as well. Maybe it will depend on what you are using Airflow to but I will share my case. I am doing data pipelines to build a data warehouse so in high level I basically have two steps:

  1. Dump a lot of data into a data-lake (directly accessible only to a few people)
  2. Load data from data lake into a analytic database where the data will be modeled and exposed to dashboard applications (many sql queries to model the data)

Today I organize the files into three main folders that try to reflect the logic above:

├── dags
│   ├── dag_1.py
│   └── dag_2.py
├── data-lake
│   ├── data-source-1
│   └── data-source-2
└── dw
    ├── cubes
    │   ├── cube_1.sql
    │   └── cube_2.sql
    ├── dims
    │   ├── dim_1.sql
    │   └── dim_2.sql
    └── facts
        ├── fact_1.sql
        └── fact_2.sql

This is more or less my basic folder structure.

fernandosjp
  • 2,658
  • 1
  • 25
  • 29
  • Just restating this is just my way to organize. If anyone comes around to this question it would be great to benchmark with other ways to structure folders and files! :) – fernandosjp Jun 22 '17 at 19:00
1

I am using Google Cloud Composer. I have to manage multiple projects with some additional SQL scripts and I want to sync everything via gsutil rsync Hence I use the following structure:

├───dags
│   │
│   ├───project_1
│       │
│       ├───dag_bag.py
│       │
│       ├───.airflowignore
│       │
│       ├───dag_1
│       │      dag.py
│       │      script.sql
│
├───plugins
│   │
│   ├───hooks
│   │      hook_1.py
│   │   
│   ├───sensors
│   │      sensor_1.py
│   │
│   ├───operators
│   │      operator_1.py

And the file dag_bag.py containes these lines

from airflow.models import DagBag

dag_bag = DagBag(dag_folder="/home/airflow/gcs/dags/project_1", include_examples=False)
p13rr0m
  • 1,107
  • 9
  • 21