3

I am trying to containerize my airflow setup. I've been tasked to keep the environment the same, just move it into a docker container. We currently have Airflow and all our dependencies installed within a anaconda environment. So what I've done is created a custom docker image that installs anaconda and creates my environment. The problem is, our current environment utilized systemd services to start airflow where Docker needs it to run via airflow command "airflow webserver/scheduler/worker" and when I run it like that, I get an error. I get the error after I start up the scheduler.

Our DAGs require a custom repo that helps communicate to our database servers. Within that repo we are using pathlib to get the path of a config file and pass it to configparser.

Basically like this:

import configparser
from pathlib import Path

config = configparser.ConfigParser()
p = Path(__file__)
p = p.parent
config_file_name = 'comms.conf'
config.read(p.joinpath('config', config_file_name))

This is throwing an the following error for all my DAGs in Airflow:

Broken DAG: [/opt/airflow/dags/example_folder/example_dag.py] 'PosixPath' object is not iterable

On the command line the error is:

[2021-01-11 19:53:13,868] {dagbag.py:259} ERROR - Failed to import: /opt/airflow/dags/example_folder/example_dag.py
Traceback (most recent call last):
  File "/opt/anaconda3/envs/airflow/lib/python3.7/site-packages/airflow/models/dagbag.py", line 256, in process_file
    m = imp.load_source(mod_name, filepath)
  File "/opt/anaconda3/envs/airflow/lib/python3.7/imp.py", line 172, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/opt/airflow/example_folder/example_dag.py", line 8, in <module>
    dag = Dag()
  File "/opt/airflow/dags/util/dag_base.py", line 27, in __init__
    self.comms = get_comms(Variable.get('environment'))
  File "/opt/airflow/repository/repo_folder/custom_script.py", line 56, in get_comms
    config = get_config('comms.conf')
  File "/opt/airflow/repository/repo_folder/custom_script.py", line 39, in get_config
    config.read(p.joinpath('config', config_file_name))
  File "/opt/anaconda3/envs/airflow/lib/python3.7/site-packages/backports/configparser/__init__.py", line 702, in read
    for filename in filenames:
TypeError: 'PosixPath' object is not iterable

I was able to replicate this behavior outside of the docker container, so I don't think that has anything to do with it. It has to be a difference between how airflow runs as a systemd service and how it runs via cli?

Here is my airflow service file that works:

[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service

[Service]
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/opt/anaconda3/envs/airflow/bin/airflow webserver
Restart=on-failure
RestartSec=5s
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Here is the airflow environment file that I'm using within the service file. Note that I needed to export these env variables locally to get airflow to run up to this point in the cli. Also note that the custom repos live in the /opt/airflow directory.

AIRFLOW_CONFIG=/opt/airflow/airflow.cfg
AIRFLOW_HOME=/opt/airflow
PATH=/bin:/opt/anaconda3/envs/airflow/bin:/opt/airflow/etl:/opt/airflow:$PATH
PYTHONPATH=/opt/airflow/etl:/opt/airflow:$PYTHONPATH

My airflow config is default, other then the following changes:

executor = CeleryExecutor
sql_alchemy_conn = postgresql+psycopg2://airflow:airflow@192.168.x.x:5432/airflow
load_examples = False
logging_level = WARN
broker_url = amqp://guest:guest@127.0.0.1:5672/
result_backend = db+postgresql://airflow:airflow@192.168.x.x:5432/airflow
catchup_by_default =  False

configparser==3.5.3

My conda environment is using python 3.7 and the airflow version is 1.10.14. It's running on a Centos7 server. If anyone has any ideas that could help, I would appropriate it!

Edit: If I change the line config.read(p.joinpath('config', config_file_name)) to point directly to the config like this config.read('/opt/airflow/repository/repo_folder/config/comms.conf') it works fine. So it has something to do with how configparser handles the pathlib output? But it doesn't have a problem with this if airflow is run via systemd service?

Edit2: I can also wrap the pathlib object in str() and it works. config.read(str(p.joinpath('config', config_file_name))) I just want to know why this works fine with the systemd service.. I'm afraid other stuff is going to be broken?

wymangr
  • 189
  • 3
  • 16
  • it's suspicious that it's using `backports.configparser` in python3.7 (which has a native `configparser` standard library module) -- I suspect something is incorrectly setting `PYTHONPATH` or mutating `sys.path` to put `site-packages` ahead of the standard library – anthony sottile Jan 14 '21 at 07:33
  • Maybe it is a problem with environment variables? I created the Anaconda environment from a yaml file. `conda env create -f airflow.yml python=3.7` but for some reason, when I activate that environment and run `airflow initdb` it gives me a bunch of ModuleNotFoundErrors, even though I can start up python from that environment and import everything it says it's missing. To make it work, I added the site-packages directory within the environment to PYTHONPATH. Do you know of a reason why airflow wouldn't be able to see the packages installed to the same environment it is? – wymangr Jan 14 '21 at 21:20
  • yeah you should never do that, and that's precisely why it's broken. my guess is you're missing some airflow configuration to make it use your environment? is airflow installed into that environment? – anthony sottile Jan 14 '21 at 21:25
  • Yeah, airflow is installed into the same environment. It and all it's dependencies are installed from a yml file. I can only run the airflow commands after activating that environment. I don't understand why it works with no problem with the systemd service file, but not from the CLI. – wymangr Jan 14 '21 at 21:45
  • how are you setting your `PYTHONPATH` on the cli? what does `which airflow` give you? do you have the exact same setup as the unit? I'd recommend finding what's different and then eliminating the differences – anthony sottile Jan 14 '21 at 23:13

2 Answers2

1

The path to the config file is computed wrongly.

This is because the following line

# filename: custom_script.py
p = p.parent
confpath = p.joinpath('config', config_file_name))

confpath evaluates to /opt/airflow/repository/repo_folder/config/comms.conf

The path you shared where the configuration file lies is /opt/airflow/repository/repo_folder/conn.conf.

You need to resolve the config file relative to repo_folder by constructing its path using the folder custom_script.py is in.

# filename: custom_script.py

from pathlib import Path

p = Path(dirname(__file__))
p = p.parent
confpath = p.joinpath(config_file_name)
Oluwafemi Sule
  • 36,144
  • 1
  • 56
  • 81
  • Yeah, you are right. I forgot to add a piece into my example. `p = Path(__file__) p = p.parent` I'll update the post. – wymangr Jan 14 '21 at 18:42
  • The path to the configuration file is still evaluated to the wrong value. You can see my updated response once more. – Oluwafemi Sule Jan 15 '21 at 09:06
1

I was able to fix this issue by uninstalling and installing a newer version of configparser.

configparser==5.0.1

wymangr
  • 189
  • 3
  • 16