3

I'm having trouble getting a templated SQL file to work in Composer. I think the problem is related to the fact that I'm packaging the DAG as a zip file in order to include additional code.

I started with this (just showing relevant parts):

dag = DAG('my_dag',
      default_args=default_args,
      schedule_interval=schedule_interval)

task0 = BigQueryOperator(
     task_id='task0',
     use_legacy_sql=False,
     bql='sql/query_file.sql',
     bigquery_conn_id=bigquery_conn_id,
     dag=dag)

The file structure looks like this:

/dags/my_dag_file.py
/dags/sql/query_file.sql
/dags/my_pkg/
/dags/my_pkg/__init__.py
/dags/my_pkg/extra_module.py

I'm zipping it like this and copying it to the Composer dags folder:

zip -r my_zip_file.zip *.py my_pkg/ sql/

This worked locally but gave an error when deployed on Composer:

TemplateNotFound: sql/query_file.sql

I'm sure I included the SQL file in the zip. I also tried moving it to the root folder (no sql/ subdirectory) but I got the same result.

I read somewhere that you need to set template_searchpath when instantiating the DAG object. I've not been able to do this successfully. When I try a relative path (sql) I get more TemplateNotFound errors. When I try an absolute path as below, I get not a directory.

Here's what I tried:

dag = DAG('my_dag',
      default_args=default_args,
      schedule_interval=schedule_interval,
      template_searchpath = os.path.dirname(__file__) + "/sql"
)

task0 = BigQueryOperator(
     task_id='task0',
     use_legacy_sql=False,
     bql='query_file.sql',
     bigquery_conn_id=bigquery_conn_id,
     dag=dag)

I also tried making 'sql' part of the task's path instead of the template searchpath, and again I tried moving everything to the root level, and got the same 'not a directory' error.

As far as I can tell the problem has something to do with the fact that the files are contained within a zip. __file__ returns /home/airflow/gcs/dags/my_zip_file.zip/my_dag_file.py. But then os.listdir(os.path.dirname(__file__)) throws the same not a directory error. So maybe because we're executing inside a zip archive, we can't use folders and paths the same way. And maybe Jinja is getting tripped up on this...? Or else maybe there is something more to do when packaging up the zip file?

[2018-06-20 15:35:34,837] {base_task_runner.py:98} INFO - Subtask: Traceback (most recent call last):
[2018-06-20 15:35:34,838] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/bin/airflow", line 27, in <module>
[2018-06-20 15:35:34,840] {base_task_runner.py:98} INFO - Subtask:     args.func(args)
[2018-06-20 15:35:34,841] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 392, in run
[2018-06-20 15:35:34,841] {base_task_runner.py:98} INFO - Subtask:     pool=args.pool,
[2018-06-20 15:35:34,842] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
[2018-06-20 15:35:34,843] {base_task_runner.py:98} INFO - Subtask:     result = func(*args, **kwargs)
[2018-06-20 15:35:34,843] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1477, in _run_raw_task
[2018-06-20 15:35:34,844] {base_task_runner.py:98} INFO - Subtask:     self.render_templates()
[2018-06-20 15:35:34,844] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 1760, in render_templates
[2018-06-20 15:35:34,845] {base_task_runner.py:98} INFO - Subtask:     rendered_content = rt(attr, content, jinja_context)
[2018-06-20 15:35:34,847] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/airflow/models.py", line 2481, in render_template
[2018-06-20 15:35:34,848] {base_task_runner.py:98} INFO - Subtask:     return jinja_env.get_template(content).render(**context)
[2018-06-20 15:35:34,849] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/jinja2/environment.py", line 812, in get_template
[2018-06-20 15:35:34,849] {base_task_runner.py:98} INFO - Subtask:     return self._load_template(name, self.make_globals(globals))
[2018-06-20 15:35:34,850] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/jinja2/environment.py", line 774, in _load_template
[2018-06-20 15:35:34,851] {base_task_runner.py:98} INFO - Subtask:     cache_key = self.loader.get_source(self, name)[1]
[2018-06-20 15:35:34,852] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/jinja2/loaders.py", line 171, in get_source
[2018-06-20 15:35:34,854] {base_task_runner.py:98} INFO - Subtask:     f = open_if_exists(filename)
[2018-06-20 15:35:34,855] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/jinja2/utils.py", line 151, in open_if_exists
[2018-06-20 15:35:34,856] {base_task_runner.py:98} INFO - Subtask:     return open(filename, mode)
[2018-06-20 15:35:34,856] {base_task_runner.py:98} INFO - Subtask: IOError: [Errno 20] Not a directory: '/home/airflow/gcs/dags/my_zip_file.zip/sql/query_file.sql'
pteehan
  • 807
  • 9
  • 19

3 Answers3

2

It looks like Airflow currently (as of 1.10 version) does not support loading templates from zipped DAGs since it uses only the jinja2.FileSystemLoader to load them (see DAG#get_template_env).

jcdenton
  • 361
  • 2
  • 10
-1

First, confirm the file structure in the ZIP is as expected.

Then, give this a try for getting the path:

os.path.join(os.path.dirname(os.path.realpath(__file__)),"sql","query_file.sql")

This is how we are getting the paths to our queries in our Airflow deployment.

Mike
  • 2,514
  • 2
  • 23
  • 37
-1

Its safer to use absolute path of the file, like below sql = os.path.abspath(os.path.join(os.path.dirname(__file__), "sql/query_file.sql"))

Because Airflow operator/task likely run your command/method under a newly created temporary directory which doesn't copy your dependencies. Check a Implementation example on Airflow Github, you will understand.

Fan
  • 410
  • 5
  • 10