I'm trying to use apache airlfow
with packaged dags (https://airflow.apache.org/docs/stable/concepts.html#packaged-dags).
I've written my code as a python package and obviously my code depends on other popular libraries such as numpy, scipy etc.
EDIT:
This is setup.py
of my custom python package:
from setuptools import setup, find_packages
from pathlib import Path
from typing import List
import distutils.text_file
def parse_requirements(filename: str) -> List[str]:
"""Return requirements from requirements file."""
# Ref: https://stackoverflow.com/a/42033122/
return distutils.text_file.TextFile(filename=str(Path(__file__).with_name(filename))).readlines()
setup(name='classify_business',
version='0.1',
python_requires=">=3.6",
description='desc',
url='https://urlgitlab/datascience/classifybusiness',
author='Marco fumagalli',
author_email='marco.fumagalli@mycompany.com',
packages = find_packages(),
license='MIT',
install_requires=
parse_requirements('requirements.txt'),
zip_safe=False,
include_package_data=True)
requirements.txt contains packages ( vertica_python, pandas, numpy etc) along with their version needed for my code.
I wrote a litte shell script based on the one provied in the doc:
set -eu -o pipefail
if [ $# == 0 ]; then
echo "First param should be /srv/user_name/virtualenvs/name_virtual_env"
echo "Second param should be name of temp_directory"
echo "Third param directory should be git url"
echo "Fourth param should be dag zip name, i.e dag_zip.zip to be copied into AIRFLOW__CORE__DAGS__FOLDER"
echo "Fifth param should be package name, i.e classify_business"
fi
venv_path=${1}
dir_tmp=${2}
git_url=${3}
dag_zip=${4}
pkg_name=${5}
python3 -m venv $venv_path
source $venv_path/bin/activate
mkdir $dir_tmp
cd $dir_tmp
python3 -m pip install --prefix=$PWD git+$git_url
zip -r $dag_zip *
cp $dag_zip $AIRFLOW__CORE__DAGS_FOLDER
rm -r $dir_tmp
The shell will install my package along with dependencies directly from gitlab, zip and then move to the dags folder.
This is the content of the folder tmp_dir before being zipped.
bin
lib
lib64
predict_dag.py
train_dag.py
Airflow doesn't seem to be able to import package installed in lib or lib64. I'm getting this error
ModuleNotFoundError: No module named 'vertica_python'
I even tried to move my custom package outside of lib:
bin
my_custom_package
lib
lib64
predict_dag.py
train_dag.py
But still getting same error.
PS: One of the problem I think relies on how to use pip
to install package in a specific location.
Airflow example use --install-option="--install-lib=/path/"
but it's unsupported:
Location-changing options found in --install-option: ['--install-lib'] from command line. This configuration may cause unexpected behavior and is unsupported. pip 20.2 will remove support for this functionality. A possible replacement is using pip-level options like --user, --prefix, --root, and --target. You can find discussion regarding this at https://github.com/pypa/pip/issues/7309.
Using --prefix
leads to a structure like above, with module not found error.
Using --target
leads to every package installed in the directory specified.
In this case I have a pandas related error
C extension: No module named 'pandas._libs.tslibs.conversion' not built
I guess that it's related to dynamic libraries that should be available at a system level?
Any hint?
Thanks