7

My Cloud Composer managed Airflow got stuck for hours since I've canceled a Task Instance that was taking too long (Let's call it Task A)

I've cleared all the DAG Runs and task instances, but there are a few jobs running and one job with Shutdown state (I suppose the job of Task A) (snapshot of my Jobs).

Besides, it seems that the scheduler is not running since recently deleted DAGs keep appearing in the dashboard

Is there a way to kill the jobs or reset the scheduler? Any idea to un-stuck the composer will be welcomed.

Ary Jazz
  • 1,576
  • 1
  • 16
  • 25

2 Answers2

8

You can restart the scheduler as follows:

From your cloud shell:

1.Determine your environment’s Kubernetes cluster:

gcloud composer environments describe ENVIRONMENT_NAME \
    --location LOCATION 

2.Get credentials and connect to the Kubernetes cluster:

gcloud container clusters get-credentials ${GKE_CLUSTER} --zone ${GKE_LOCATION}

3.Run the following command to restart the scheduler:

kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -

Steps 1 and 2 are detailed here. Step 3 basically replaces the “airflow-scheduler” deployment with itself, thus restarting the service.

If restarting the scheduler doesn’t help you may as well need to recreate your Composer Environment and Troubleshoot your DAGs if this happens every time.

ch_mike
  • 1,556
  • 6
  • 11
  • 1
    Looks like you accidentally pasted the same snippet for step 2 as step 1. – Wilson Aug 21 '18 at 00:31
  • 4
    You can simply delete the `airflow-scheduler` Pod, which will cause Kubernetes to replace it with a new one. – skyler Jan 27 '19 at 14:20
  • What about restarting the airflow web server? I tried killing restarting the airflow-scheduler and even deleting the Pod but the webserver continues to run and I need to restart it.. – Leo Jan 28 '19 at 16:24
  • @Leo, you should be able to force a redeployment of the Airflow web server by [updating the PyPackages](https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies), for example by installing a dummy dependency. Depending on your use case, deploying a [self-managed Airflow web server](https://cloud.google.com/composer/docs/how-to/managing/deploy-webserver) may be a good alternative. – ch_mike Jan 29 '19 at 17:55
  • @ch_mike - It doesn't look like there is deployment for airflow-scheduler or airflow-worker -- only airflow-monitoring and airflow-sqlproxy. Do you have another workaround to restart the scheduler? – ethanenglish Aug 21 '19 at 15:35
0

Which version of Composer are you running? It's a known issue that jobs may get stuck for beta versions. Composer 1.0.0 and 1.1.0 should not see any stuck jobs (except tasks in SubDag, which is a known Airflow bug), consider migrating to the latest Composer version.

Feng Lu
  • 691
  • 5
  • 6
  • we are actually using 1.1.0, set up on monday this week and also see tasks which do not get scheduled/queued or do not do any status change it all. If this happens it happens for all dags in our composer. Restarting the scheduler as described above helps. Can you point to the bug with stucked tasks in SubDags? – mniehoff Aug 24 '18 at 09:39
  • Sub-DAG are automatically getting marked as a backfill (ie airflow uses backfill scheduler for all subDAGs). Unfortunately Airflow currently doesn't re-queue backfilled tasks despite failures (see here: https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L255), more details here: https://issues.apache.org/jira/browse/AIRFLOW-1059. – Feng Lu Aug 25 '18 at 18:17