One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download
function here:
hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default')
for input_file in hook.list(bucket, prefix=folder):
hook.download(bucket=bucket, object=input_file)
In my tests the folder contains a single 20Mb json file.
The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED
and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300
but not sure.
If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout
which updates the task correctly to FAILED
or UP_FOR_RETRY
state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?