10

I am quite new to using apache airflow. I use pycharm as my IDE. I create a project (anaconda environment), create a python script that includes DAG definitions and Bash operators. When I open my airflow webserver, my DAGS are not shown. Only the default example DAGs are shown. My AIRFLOW_HOME variable contains ~/airflow. So i stored my python script there and now it shows.

How do I use this in a project environment?

Do I change the environment variable at the start of every project?

Is there a way to add specific airflow home directories for each project?

I dont wanna be storing my DAGs in the default airflow directory since I would wanna add it to my git repository. Kindly help me out.

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
Command
  • 515
  • 2
  • 6
  • 19

2 Answers2

21

You can set/override airflow options specified in ${AIRFLOW_HOME}/airflow.cfg with environment variables by using this format: $AIRFLOW__{SECTION}__{KEY} (note the double underscores). Here is a link to airflow docs. So you can simply do

export AIRFLOW__CORE__DAGS_FOLDER=/path/to/dags/folder

However, it is tedious and error prone to do this for different projects. As alternative, you can consider using pipenv for managing virtual environments instead of Anaconda. Here is a nice guide about pipenv and problems it solves. One of the default features of pipenv is that it automatically loads variables defined in .env file when you spawn a shell with the virtualenv activated. So here is what your workflow with pipenv could look like:

cd /path/to/my_project

# Creates venv with python 3.7 
pipenv install --python=3.7 Flask==1.0.3 apache-airflow==1.10.3

# Set home for airflow in a root of your project (specified in .env file)
echo "AIRFLOW_HOME=${PWD}/airflow" >> .env

# Enters created venv and loads content of .env file 
pipenv shell

# Initialize airflow
airflow initdb
mkdir -p ${AIRFLOW_HOME}/dags/

Note: usage of Flask==1.03 I will explain at the end, but this is because pipenv checks whether sub-dependencies are compatible in order to ensure reproducibility.

So after these steps you would get the following project structure

my_project
├── airflow
│   ├── airflow.cfg
│   ├── airflow.db
│   ├── dags
│   ├── logs
│   │   └── scheduler
│   │       ├── 2019-07-07
│   │       └── latest -> /path/to/my_project/airflow/logs/scheduler/2019-07-07
│   └── unittests.cfg
├── .env
├── Pipfile
└── Pipfile.lock

Now when you initialize airflow for the first time it will create ${AIRFLOW_HOME}/airflow.cfg file and will use/expand ${AIRFLOW_HOME}/dags as value for dags_folder. In case you still need a different location for dags_folder, you can use .env file again

echo "AIRFLOW__CORE__DAGS_FOLDER=/different/path/to/dags/folder" >> .env

Thus, you .env file will look like:

AIRFLOW_HOME=/path/to/my_project/airflow
AIRFLOW__CORE__DAGS_FOLDER=/different/path/to/dags/folder

What have we accomplished and why this would work just fine

  1. Since you installed airflow in virtual environment, you would need to activate it in order to use airflow
  2. Since you did it with pipenv, you would need to use pipenv shell in order to activate venv
  3. Since you use pipenv shell, you would always get variables defined in .env exported into your venv. On top of that pipenv will still be a subshell, therefore, when you exit it, all additional environmental variables would be cleared as well.
  4. Different projects that use airflow would have different locations for their log files etc.

Additional notes on pipenv

  1. In order to use venv created with pipenv as your IDE's project interpreter, use path provided by pipenv --py.
  2. By default, pipenv creates all venvs in the same global location like conda does, but you can change that behavior to creating .venv in a project's root by adding export PIPENV_VENV_IN_PROJECT=1 into your .bashrc (or other rc). Then PyCharm would be able to automatically pick it up when you go into settings of project interpreter.

Note on usage of Flask==1.0.3

Airflow 1.10.3 from PyPi depends on flask>=1.0, <2.0 and on jinja2>=2.7.3, <=2.10.0. Today, when I tested code snippets the latest available flask was 1.1.0 which depends on jinja2>=2.10.1. This means that although pipenv can install all required software, but it fails to lock dependencies. So for clean use of my code samples, I had to specify version of flask that requires version of jinja2 compatible with airflow requirements. But there is nothing to worry about. The latest version of airflow on GitHub is already fixed that.

rbatt
  • 4,677
  • 4
  • 23
  • 41
Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
  • Is there a significant difference between using pipenv and virtualenv? Can I use any one or does using pipenv have a significant advantage for using airflow? – Command Aug 06 '19 at 19:32
  • `pipenv` build on top of `pip` and `virtualenv` and makes use of them under the hood. In your case, advantage comes mainly from built-in ability of using `*.env` files where you could define `AIRFLOW_HOME` for a particular project, since doing it in `.bashrc` (or similar) would be still be global. – Ilya Kisil Aug 06 '19 at 19:55
  • Thank you. How do I install the latest version of airflow which solves the issue? Is there an equivalent to `pip install --upgrade apache-airflow` to do that or should I download from github manually? – Command Aug 06 '19 at 20:17
  • It looks like, `airflow` released 1.10.4 on pypi several hours ago. I just tried simply `pipenv install apache-airflow` and everything worked fine. In general, I would expect `pipenv` work similar to `pip` in terms of passing options, but you'd better look into their documentation. – Ilya Kisil Aug 06 '19 at 20:42
  • I did everything you said, created a dag file name `test_dag.py` in `dags` directory. I am in my pipenv shell and I instantiate the web server using `airflow webserver -p 8080`. It works. But I dont see my `test_dag` there. I see a whole bunch of other dags such as * example_bash_operator – Command Aug 06 '19 at 20:47
  • Have you started `airflow scheduler`? It helps for me when I create a new dag, but don't see it in a web UI – Ilya Kisil Aug 06 '19 at 20:56
  • I started it. Still doesn't recognize it. The weird thing is that the example dags shown in the web server are not there in my virtual environment dags folder. – Command Aug 06 '19 at 21:56
-1

Edit the airflow.cfg file and set:

load_examples = False
dags_folder = /path/to/your/dag/files

If your airflow directory is not set to the default, you should set this env variable. if it's annoying to change it every time, just set it in your pycharm project configuration or in your local OS (~/.bashrc).

My suggestion is to write a tiny script to execute airflow in a local mode. you should execute the airflow webserver&scheduler&workers individually.

For me, it's much more convenient to run airflow services on a small development machine, consider to do that.

Ohad Mata
  • 81
  • 1
  • 6