Cannot kill VMware VM

Question

I am encountering a very stubborn VM (2008R2, VMware tools just slightly outdated, the ones that came with 5.5U3a) on a ESXi 6.0U2 cluster running on Dell R630 servers. From the outside, the VM becomes unresponsive after some time - might be a day, might be a week - and it's no longer responding to pings, connection requests and so on (it runs an industrial application and some MSSQL). That behaviour could already be observed when the cluster ran 5.5U3a, though.

So, I try to restart the VM via the webclient or via the fat client. Nothing happens. Like, for hours. Next escalation step:

esxcli vm process kill -w <worldID> -t soft

No response, no change. Skip -t hard and directly go to

esxcli vm process kill -w <worldID> -t force

No response as well. The VM keeps chugging along being unresponsive and all, but the world simply refuses to be killed. There's no error message, either. Rebooting the host with the VM is the last resort.

How can I identify the root cause for this very strange behaviour?

What about the patch level of the os and sql? Maybe there is some ms kb that refers to similar issues still not applied? — Paolo, Aug 11 '16 at 10:54

score 4 · Answer 1 · answered Aug 11 '16 at 08:04

4

After having identified the right process using ps | grep vmx, you can abruptly terminate it via kill -9 <pid>

Be very careful to select (and kill) the right process. For more information, give a look here

If nothing works, according to VmWare's own documentation, you had to reboot the ESX host

answered Aug 11 '16 at 08:04

shodanshok

47,711
7
111
180

That's what I already did. I'm just wondering what might be the cause? – mexell Aug 12 '16 at 06:37
It's difficult to tell. Maybe the VM process is stuck in a uninterruptible sleep (kernel-side waiting), and so even `kill` can't do nothing. What it is telling is that even VmWare suggest to reboot the ESX host... – shodanshok Aug 12 '16 at 08:41

score 3 · Answer 2 · answered Aug 11 '16 at 07:15

How can I identify the root cause for this very strange behaviour?

Scientific method is your friend.

Define the problem you want to solve. It looks like you have 2 (possibly interrelated) issues. The VM becomes unresponsive and ESXi can't kill it.
Gather data. Look in the logs, your monitoring etc for relevant information.
Analyse the data.
Make changes based on your analysis.
Verify the changes work. If they don't go back to 2 or 3 and gather more data/reanalyse.
Document your findings.

Cannot kill VMware VM

2 Answers2