0

One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download function here:

hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default') for input_file in hook.list(bucket, prefix=folder): hook.download(bucket=bucket, object=input_file)

In my tests the folder contains a single 20Mb json file.

The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300 but not sure.

If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout which updates the task correctly to FAILED or UP_FOR_RETRY state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?

TAPeri
  • 13
  • 1
  • 6

1 Answers1

0

There was a discussion on the Cloud Composer Discuss group about this: https://groups.google.com/d/msg/cloud-composer-discuss/alnKzMjEj8Q/0lbp3bTlAgAJ. It is a problem with the Celery executor when Airflow workers die.

Although Composer is working on a fix, if you want this to happen less frequently in the current version, you may consider reducing your parallelism Airflow configuration or creating a new environment with a larger machine-type.

Trevor Edwards
  • 431
  • 2
  • 6