5

I've been using Google Composer for a while (composer-0.5.2-airflow-1.9.0), and had some problems with the Airflow scheduler. The scheduler container crashes sometimes, and it can get into a locked situation in which it cannot start any new tasks (an error with the database connection) so I have to re-create the whole Composer environment. This time, there is a CrashLoopBackOff and the scheduler pod cannot restart anymore. The error is very similar to what I've also had before. Here's the traceback from Stackdriver:

Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 27, in <module>
    args.func(args)
  File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 826, in scheduler
    job.run()
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 198, in run
    self._execute()
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 1549, in _execute
    self._execute_helper(processor_manager)
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 1594, in _execute_helper
    self.reset_state_for_orphaned_tasks(session=session)
  File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 266, in reset_state_for_orphaned_tasks
    .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2783, in all
    return list(self)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2935, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2958, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
    context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 508, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 250, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 50, in defaulterrorhandler
    raise errorvalue
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: u'SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid \nFROM task_instance \nWHERE (task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s) AND task_instance.state IN (%s, %s) FOR UPDATE'] [parameters: ('pb_write_event_tables_v2_dev2', 'check_table_chest_progressed', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_name_changed', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_registered', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_unit_leveled_up', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_virtual_currency_earned', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_virtual_currency_spent', datetime.datetime(2018, 6, 26, 8, 0), u'scheduled', u'queued')] (Background on this error at: http://sqlalche.me/e/e3q8)

I'm out of my depth with technical RDBMS errors. However, this is an out-of-the-box Google Composer with the default environment, so I wonder if anyone else has had a similar problem or has some idea what's going on? I've understood that Composer uses Google Cloud SQL for the DB, and apparently(?) MySQL backend.

The Airflow Scheduler image is gcr.io/cloud-airflow-releaser/airflow-worker-scheduler-1.9.0:cloud_composer_service_2018-06-19-RC3.

I have to add that I didn't encounter this scheduler problem with a self-made Airflow Kubernetes setup, but then I was using a bleeding-edge Airflow version with PostgreSQL.

Dalar
  • 135
  • 1
  • 7
  • 2
    Another (but maybe not related) connection error that I sometimes get with tasks is `sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2005, "Unknown MySQL server host 'airflow-sqlproxy-service' (110)") (Background on this error at: http://sqlalche.me/e/e3q8)` – Dalar Jun 28 '18 at 12:24
  • Again another (but maybe not related) error arises from Redis connection when there's tasks that stay indefinitely in "queued" state. The tasks won't re-start even when their state is cleared and the dag is "running". When trying to start them manually from a cleared state, I get this error: `OperationalError: Error -2 connecting to airflow-redis-service:6379. Name or service not known.` – Dalar Jun 29 '18 at 11:15
  • We have exact same error, the Kubernetes workload keeps crashing – MarkeD Jul 02 '18 at 14:51
  • Side question - Where do you see these logs? – Maxim Veksler Jul 05 '18 at 07:32
  • In the Google Cloud console general logs, you can select the composer environment and/or the kubernetes cluster/workloads it is running upon. – MarkeD Jul 06 '18 at 09:45
  • For an update: The Composer seems to be much more stable now, at least after the August update. Last week I had to re-initialize the whole cluster, and DAG runs seem to be more stable and robust now. – Dalar Aug 31 '18 at 08:53
  • What do you mean by "re-initialize the whole cluster"? Is there any easy way to restart the whole thing without removing and creating a new cluster? – Leo Jan 28 '19 at 16:04
  • I don't think there's currently a way to just restart the whole cluster in Composer (you shouldn't need to do that with Kubernetes). I wrote myself a setup script that uses `gcloud` commands. – Dalar Feb 26 '19 at 09:20

1 Answers1

0

This can be cause by overwhelming the resources:

To prevent this you can use the async DAGs load or have the environment use higher machine types.

Furthermore, I recommend using the latest version composer-1.10.6-airflow-1.10.6 as issues are fixed.

Nathan Nasser
  • 1,008
  • 7
  • 18