0

I want to use the Python library rapidjson in my Airflow DAG. My code repo is hosted on Git. Whenever I merge something into the master or test branch, the changes are automatically configured to reflect on the Airflow UI.

My Airflow is hosted as a VM on AWS EC2. Under the EC2 instances, I see three different instances for: scheduler, webserver, workers.

I connected to these 3 individually via Session Manager. Once the terminal opened, I installed the library using

pip install python-rapidjson

I also verified the installation using pip list. Now, I import the library in my dag's code simply like this:

import rapidjson

However, when I open the Airflow UI, my DAG has an error that:

No module named 'rapidjson'

Are there additional steps that I am missing out on? Do I need to import it into my Airflow code base in any other way as well?

Within my Airflow git repository, I also have a "requirements.txt" file. I tried to include

python-rapidjson==1.5.5

this there as well but I do not know how to actually install this.

I tried this:

pip install requirements.txt

within the session manager's terminal as well. However, the terminal is not able to locate this file. In fact, when I do "ls", I don't see anything.

pwd
/var/snap/amazon-ssm-agent/6522
x89
  • 2,798
  • 5
  • 46
  • 110

1 Answers1

1

Have you tried using the PythonVirtualEnvOperator ?

It will allow you to install the library at runtime so you don't need to make changes on the server just for one job.

To run a function called my_callable, simply use the following:

from airflow.operators.python import PythonVirtualenvOperator


my_task = PythonVirtualenvOperator(
        task_id="my_task ",
        requirements="python-rapidjson==1.5.5",
        python_callable=my_callable,
    )

I still recommend updating your server environment for core libs, but this is a best practice when using special libs for a small minority of jobs.

scr
  • 853
  • 1
  • 2
  • 14
  • what's the correct way to import this operator? I tried this: from ```airflow.providers.virtualenv.operators.python_virtualenv import PythonVirtualenvOperator``` but it gives me an error that: ```No module named 'airflow.providers'``` – x89 Jun 03 '23 at 17:15
  • and then how would i really use the library in the code? because if i import it normally in the beginning of the dag file "import rapidjson", it will still crash the dag – x89 Jun 03 '23 at 17:17
  • I've added the package import to the code. – scr Jun 03 '23 at 17:27
  • I have realized that with this method, I will need to re-import ALL libraries or functions again into this python callable since it's a virtual environment and can't access other imports outside the function. Is there any way to avoid this? So that I don't have to reimport everything and I can still call other functions from the dag file inside this function? – x89 Jun 04 '23 at 14:04
  • Did you change `system_site_packages` to False? To my knowledge, you only need to import the libraries that you've installed in your VirtualEnv. – scr Jun 06 '23 at 04:47