Databricks Cluster terminated. Reason: Cloud Provider Launch Failure

Question

I'm using Azure Databricks with a custom configuration that uses vnet injection and I am unable to start a cluster in my workspace. The error message being given is not documented anywhere in microsoft or databricks documentation meaning I am unable to diagnose the reason why my cluster is not starting. I have reproduced the error-message below:

Instance ID: [redacted]

Azure error message: 
Instance bootstrap failed.
Failure message: Cloud Provider Failure. Azure VM Extension stuck on transitioning state. Please try again later.
VM extension code: ProvisioningState/transitioning
instanceId: InstanceId([redacted])
workerEnv: workerenv-6662162805421143
Additional details (may be truncated): Enable in progress

Although it says "Please try again later" I have been trying this all day and getting the same message, leading me to think this error message is not descriptive and there is something else really going on.

Does anyone have ideas on what the issue could be?

I'm seeing this error too - in WestEurope region. Tried it twice throughout today, given up for now. Not encouraging that you where seeing this 7 days ago now, did you get it sorted? — nmca70, Apr 16 '21 at 20:13
Yes, though I wasn't able to solve the original issue. I believe it has something to do with cluster connectivity failure. I tried making a custom route-table as given here: [User-defined route settings for Azure Databricks](https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/udr#:~:text=Table%201%20%20%20%20Source%20%20,%20Metastore%20IP%20%203%20more%20rows%20) but that didn't fix the issue. We ended up just switching from vnet injection to vnet peering and were able to start the cluster. Hopefully that works with your network arch @nmca70 — Abhishek Sharma, Apr 18 '21 at 14:53

score 3 · Accepted Answer · answered Apr 18 '21 at 14:58

This seems to be an issue with connectivity from the databricks instance to the central databricks servers. Our vnet injection settings seem to have not been sufficient to route requests to the right place. Ultimately the problem was fixed by changing the databricks instance to use vnet peering (with its own custom vnet) instead of vnet injection. This way the databricks instance was able to communicate with our resource in another vnet while still being able to start the cluster.

This fulfilled our project requirements, but there may be cases where its not sufficient for what a project requires. Hopefully the Azure Databricks team at least documents this issue to create less confusion in the future.

I also tried creating custom user defined routes for databricks but that did not fix the issue.

score 0 · Answer 2 · edited May 09 '22 at 20:31

Cloud Provider Failure. Azure VM Extension stuck on transitioning state. Please try again later.

This is a cloud provider issue (Azure). On Azure, Databricks uses Azure VM extension services to do bootstrap steps. This error means Azure extension service can't finish the extension and send result back to us.

This is a well-known Azure extension issue. But it's transient. Retry starting cluster will fix the issue.

score 0 · Answer 3 · answered Dec 22 '22 at 17:14

I am also running into this error. But: I cannot try to restart the machine because the error is passed back to the terraform agent and when looking into the Databricks workspace there is no compute cluster there after the terraform apply failure.

This is a bit annoying and I don't think that Vnet peering solves my issue, since I need a static IP for the Databricks clusters.

From what I understand that requires Vnet injection and a NAT gateway associated with the container subnet.

Not sure what the correct format is for this request. It's not a new question and I don't want it to disappear in the comments. Could someone at Databricks/Azure please fix this. it's quite annoying that the Databricks documentation is wrong and all the big tutorials on Vnet injection are not working.

A static IP for the clusters should really be a default in any enterprise environment ... not to speak of external storage in private subnets...

score 0 · Answer 4 · answered Jun 28 '23 at 15:09

I was getting the same error, but for a different reason. Sometimes, usually at 7 am, my clusteres were failing to start, giving the same error. A new restart attempt at 8 am would work.

During one of these failures I noticed that the IP address of the Artifact Blob storage primary was different from what I've configured in the UDR for my region. In this exact moment I realized that the IP addresses of those services in this website are dyanmic. The problem is that you cannot put hostnames in the UDR in Azure :( So this is exactly what I did:

For every service of your workspace region form here you need to (I will take an example of dbartifactsprodnortheu.blob.core.windows.net):

Give a ping dbartifactsprodnortheu.blob.core.windows.net and take note of the IP address (example here 20.150.84.36)
Go to https://iplocation.io/ip-whois-lookup/ and put the IP address from Step 1 and hit Check Now. Take note of the all CIDR result.
For each CIDR result, add a new record in your Route table. Choose a name for the Route name field as, Artifact-Blob-storage-primary-01, set the IP Address, the first CIDR from Step 2 and Next hop Internet:
Repeat Step 3, but now you can called it as Artifact-Blob-storage-primary-02 and use the second CIDR 20.150.0.0/15. Depending on your region you can have several CIDRs
Repeat all steps above for Artifact Blob storage secondary

Now any dynamically IP assumed to those hosts may resolve to the internet as expected.

Databricks Cluster terminated. Reason: Cloud Provider Launch Failure

4 Answers4