4

I am tasked with recovering a VMWare 6.5 cluster that, after an unexpected power failure, has a VM (the most important one...) stuck at boot.

From the vmware.log file, it seems the problem is related to a corrupted CTK file and, as I read on this vmware KB, it should be sufficient to remove the affected CTK file (ok, not really so simple, but simple enough...)

However the affected VM has some snapshots active and, as I read on another (older) KB, such a procedure should not be attempted if snapshots are present.

What is the right path/procedure to unstuck the VM and letting the boot process to complete?

shodanshok
  • 47,711
  • 7
  • 111
  • 180

2 Answers2

2

In this case, the solution was the simplest, yet strangest, possible: to wait for the night. After some hours, both VM "unstuck" and correctly booted.

Regarding the change-tracking-file (CTK) question, I simulated the problem with a spare VMWare hypervisor and, after reading VMWare own documentation (quite light on details...) I think the key point it that you can delete the CTK files even if the virtual machines has active snapshots, but such changes can corrupt any subsequent CTK-aware backups. So, in such cases, you also need to disable CTK on VM and disk level, consolidate any snapshots, do a full backup, re-enable CTK (again, both on VM and disk level) and re-enable incremental backups.

Disabling CTK seems to have effect on the last CTK file only (note: a CTK file exists for each VMDK flat and delta files, so each snapshot commands a new CTK file) and this seems to be the reason VMWare recommend to have no snapshots when enabling/disabling block change tracking. From here:

Note: Ensure that there are no snapshots on the virtual machine before enabling change tracking. If you create snapshots before enabling CBT, the QueryChangedDiskAreas API might not return any error or the data returned by QueryChangedDiskAreas might be incorrect.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
0

You can safely delete -ctk.vmdk files, you will also need to delete the CTK related information in the .vmdk text file descriptors.

# Change Tracking File
changeTrackPath="YOUR-VM-ctk.vmdk"

Also the CTK related info in the .vmx file. There is one general entry and one per disk as below.

ctkEnabled = "TRUE"
scsi0:0.ctkEnabled = "TRUE"
scsi0:1.ctkEnabled = "TRUE"

Then delete the snapshots if you will and re-add the CTK info above to the .vmx file. When you reboot the -ctk.vmdk file will be generated again reinitializing sequence numbers. Your backup software should just make a full sync and retake the CBT sequence numbers on subsequent runs.

This is a guide on how to troubleshoot ©ESXi snapshots.

The main concern when working with snapshots are busy services like databases or Active Directory, you need some bridge service so that ©VMWare Tools can orderly pause them so that remaining data is flushed to disk before taking the snapshot, or, address the issue manually by using ©VMWare Tools scripts that will handle services for you.

One of the main misunderstandings from part of users is the wrong belief that just requesting the snapshot to be quiesced in busy servers is enough.

Daniel J.
  • 214
  • 1
  • 5