2

I set up a test installation of airflow a while ago with one test DAG which is in paused state.

Now, after this system ran for some weeks without actually doing much (beside some test runs), I wanted to dump the database and realized, it is huge.

This is due to the dag_pickle table, apprently airflow creates a pickle object and persits that in the database every second.

However, none of these pickle rows in the database are related to the dag, the pickle_id field in the dag row is empty:

mysql> select * from dag;
+------------------+-----------+-----------+-----------+----------------------------+--------------+----------------------------+----------------+-----------+---------------------------------------+---------+
| dag_id           | is_paused | is_subdag | is_active | last_scheduler_run         | last_pickled | last_expired               | scheduler_lock | pickle_id | fileloc                               | owners  |
+------------------+-----------+-----------+-----------+----------------------------+--------------+----------------------------+----------------+-----------+---------------------------------------+---------+
| test_dag         |         1 |         0 |         1 | 2017-05-10 11:13:57.987710 | NULL         | 2017-05-10 09:40:19.449429 |           NULL |      NULL | /opt/airflow/dags/test_dag.py         | airflow |
+------------------+-----------+-----------+-----------+----------------------------+--------------+----------------------------+----------------+-----------+---------------------------------------+---------+
1 row in set (0.00 sec)

Checking the dag_pickle table we see that every second, one row is created:

mysql> select max(created_dttm), count(*), now() from dag_pickle;
+----------------------------+----------+---------------------+
| max(created_dttm)          | count(*) | now()               |
+----------------------------+----------+---------------------+
| 2017-05-10 11:20:26.000000 |  1893572 | 2017-05-10 11:20:26 | 
+----------------------------+----------+---------------------+
1 row in set (13.71 sec)

mysql> select max(created_dttm), count(*), now() from dag_pickle;
+----------------------------+----------+---------------------+
| max(created_dttm)          | count(*) | now()               |
----------------------------+----------+---------------------+
| 2017-05-10 11:20:40.000000 |  1893586 | 2017-05-10 11:20:41 |
+----------------------------+----------+---------------------+
1 row in set (12.55 sec)

mysql> SELECT 1893586-1893572 as delta, TIMEDIFF('2017-05-10 11:20:41','2017-05-10 11:20:26') as time_delta;
+-------+------------+
| delta | time_delta |
+-------+------------+
|    14 | 00:00:15   |
+-------+------------+
1 row in set (0.00 sec)

(As you can see, this table is so big that it becomes an operational burden, it is by now 7.5 GB of size)

My questions now are:

  1. Are all these pickle rows used for anything, or can I savely delete them? (I assume the latter, because there is no reference from a dag) Is there some cleanup procedure to remove unused stuff from the database?

  2. Is this an error in my configuration or in airflow? Can I configure airflow such that these pickle object are only created when and if they are actually needed?

  3. What effect does the donot_pickle configuration option have exactly, can I still run the DAG from a worker on another machine with the CeleryExecutor if donot_pickle is set to True?

Thanks!

Alexander Köb
  • 944
  • 1
  • 9
  • 19
  • any chance you can share your airflow.cfg + any AIRFLOW envvars? I am looking into something related. – Nick Sep 15 '17 at 20:50
  • were you able to figure out the solution ? – Gaurav Shah Jul 23 '18 at 15:57
  • 1
    No, I never figured that out and I do not work at this project any more. As far as I remember, I simply set donot_pickle to true and was done with it. I still could run DAGs on other machines (as for question 3). – Alexander Köb Jul 27 '18 at 20:17

0 Answers0