1

I'm trying to use apache airlfow with packaged dags (https://airflow.apache.org/docs/stable/concepts.html#packaged-dags).

I've written my code as a python package and obviously my code depends on other popular libraries such as numpy, scipy etc.

EDIT: This is setup.py of my custom python package:

from setuptools import setup, find_packages
from pathlib import Path
from typing import List

import distutils.text_file

def parse_requirements(filename: str) -> List[str]:
    """Return requirements from requirements file."""
    # Ref: https://stackoverflow.com/a/42033122/
    return distutils.text_file.TextFile(filename=str(Path(__file__).with_name(filename))).readlines()


setup(name='classify_business',
      version='0.1',
      python_requires=">=3.6",
      description='desc',
      url='https://urlgitlab/datascience/classifybusiness',
      author='Marco fumagalli',
      author_email='marco.fumagalli@mycompany.com',
      packages = find_packages(),
      license='MIT',
      install_requires=
      parse_requirements('requirements.txt'),
      zip_safe=False,
      include_package_data=True)

requirements.txt contains packages ( vertica_python, pandas, numpy etc) along with their version needed for my code.

I wrote a litte shell script based on the one provied in the doc:

set -eu -o pipefail

if [ $# == 0 ]; then
    echo "First param should be /srv/user_name/virtualenvs/name_virtual_env"
    echo "Second param should be name of temp_directory"
    echo "Third param directory should be git url"
    echo "Fourth param should be dag zip name, i.e dag_zip.zip to be copied into AIRFLOW__CORE__DAGS__FOLDER"
    echo "Fifth param should be package name, i.e classify_business"
fi


venv_path=${1}
dir_tmp=${2}
git_url=${3}
dag_zip=${4}
pkg_name=${5}



python3 -m venv $venv_path
source $venv_path/bin/activate
mkdir $dir_tmp
cd $dir_tmp

python3 -m pip install --prefix=$PWD git+$git_url

zip -r $dag_zip *
cp $dag_zip $AIRFLOW__CORE__DAGS_FOLDER

rm -r $dir_tmp

The shell will install my package along with dependencies directly from gitlab, zip and then move to the dags folder.

This is the content of the folder tmp_dir before being zipped.

bin  
lib  
lib64  
predict_dag.py  
train_dag.py

Airflow doesn't seem to be able to import package installed in lib or lib64. I'm getting this error

ModuleNotFoundError: No module named 'vertica_python'

I even tried to move my custom package outside of lib:

bin
my_custom_package
lib  
lib64  
predict_dag.py  
train_dag.py

But still getting same error.

PS: One of the problem I think relies on how to use pip to install package in a specific location. Airflow example use --install-option="--install-lib=/path/" but it's unsupported:

Location-changing options found in --install-option: ['--install-lib'] from command line. This configuration may cause unexpected behavior and is unsupported. pip 20.2 will remove support for this functionality. A possible replacement is using pip-level options like --user, --prefix, --root, and --target. You can find discussion regarding this at https://github.com/pypa/pip/issues/7309.

Using --prefix leads to a structure like above, with module not found error.

Using --target leads to every package installed in the directory specified. In this case I have a pandas related error

C extension: No module named 'pandas._libs.tslibs.conversion' not built

I guess that it's related to dynamic libraries that should be available at a system level?

Any hint?

Thanks

Marco Fumagalli
  • 2,307
  • 3
  • 23
  • 41
  • Is the missing `vertica_python` package listed as a `install_requires` package in your custom DAGs packages setup.py file? Or is that installed separately? – bartaelterman May 05 '20 at 08:40
  • All packages ( vertica_python, pandas, numpy etc) are installed along with my custom package in setup.py. See edit question – Marco Fumagalli May 05 '20 at 08:52

2 Answers2

0

The Airflow documentation page you're referring to says this about packaged DAGs:

To allow this you can create a zip file that contains the DAG(s) in the root of the zip file and have the extra modules unpacked in directories.

The way I interpret this is different from yours. I don't think Airflow handles these packaged DAGs as a real python package. It just seems like a custom zip folder that will be added to your DAGs folder. So the lib or lib64 folders you have are probably not real python modules (they don't have a __init__.py file). That's why they say that "the extra modules should be unpacked in directories".

Look at the example zip file they give:

my_dag1.py
my_dag2.py
package1/__init__.py
package1/functions.py

package1 has a __init__.py file. So in your case, your vertica_python library should be directly importable like this:

my_custom_package
vertica_python/
predict_dag.py  
train_dag.py

However, I don't think you should do this. I have the impression that the modules that you should add here are your own developed modules, not third party libraries.

So I suggest that you install the libraries you need to run your packaged DAGs beforehand.

bartaelterman
  • 795
  • 10
  • 26
0

Ciao Marco,

I know this is an old question, but I had to go through the very same process and what worked for me was to use:

pip install -r ../requirements_dag.txt --target="$PWD"

The same works for packages hosted on git. The key difference is the use of --target rather than --prefix.

Dharman
  • 30,962
  • 25
  • 85
  • 135
AlessioG
  • 576
  • 5
  • 13
  • 32
  • Hi alessio, thanks. But I think that packaged dags are a poor solutions because of this "packaged dags cannot contain dynamic libraries (eg. libz.so) these need to be available on the system if a module needs those. In other words only pure python modules can be packaged." I ended up using bashoperator with a little overhead in the code but with language agnostic advantage. – Marco Fumagalli Aug 07 '20 at 09:48
  • 2
    True, I agree with you on that. My main reason for using packaged dags, it's because we have DAGs sourced from different projects / teams. We used a KubernetesExecutor to deal dependencies in a better way. – AlessioG Aug 07 '20 at 12:23