Azure Databricks: Unexpected failure while waiting for the cluster to be ready. Cause Cluster is unusable since the driver is unhealthy

Question

I have some scheduled data pipelines that are orchestrated via Azure Data Factory, each with a Databricks activity that runs on a job cluster.

All my Databricks activities are stuck in retry loops and failing with the following error,

Databricks execution failed with error state: InternalError, error message: Unexpected failure while waiting for the cluster <cluster-id> to be ready.Cause Cluster <cluster-id> is unusable since the driver is unhealthy.

My Databricks cluster is not even starting up.

This issue is quite similar to what has been posted here,
AWS Databricks cluster start failure

However, there are a few differences,

My pipelines are running on Azure: Azure Data Factory and Azure Databricks
I can spin up my interactive clusters (in the same workspace) without any problem
I have checked with my colleagues who are running similar pipelines on different subscriptions (in the same region), but they are not facing any issue

Any idea what is going on here? Is it just a service interruption of sorts or is there something I can do resolve this?

Is the problem only with one interactive cluster? Have you tried to create another cluster and use it instead? — Saideep Arikontham, Nov 18 '22 at 04:18
No, the Databricks activities that I have spin up a job cluster. — Minura Punchihewa, Nov 18 '22 at 04:23
Is it possible for you to create a cluster, select existing interactive cluster instead of new job cluster and try? — Saideep Arikontham, Nov 18 '22 at 05:36

Minura Punchihewa · Accepted Answer · 2022-11-20T06:52:16.187

It turns out that my pipelines were failing because the init script that has been configured for our clusters is not executing correctly.

We have a in-built Python package that we maintain in Azure Artifacts. To install this package, we need to use a DevOps token. To install the package in our clusters, a command is available in the init script and because the token has expired, the init script was failing.

As a result, the cluster could not start up properly. The error message is quite cryptic though. "Cause Cluster is unusable since the driver is unhealthy" could literally mean anything.

However, if you come across this yourselves, check your init script.

Note: Another hint here was that when we looked through the Event log, we noticed that the time between the events INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED was very long. More so than it should actually take.

Azure Databricks: Unexpected failure while waiting for the cluster to be ready. Cause Cluster is unusable since the driver is unhealthy

1 Answers1