1

My RHEL 6.1 virtual machines running on ESXi hosts 5.1.0 get frozen when the backup procedure tries to get a quiesced snapshot. I have to reset the virtual machines from VMware console. The tool used to quiesce the virtual machine is SYMCquiesce 1.0.0 by Symantec. The most weird is it does not fail always -- the backup is launched every week... but it fails only every two weeks.

After reading the VMware document http://kb.vmware.com/kb/2038606 and the RedHat document https://access.redhat.com/solutions/484303 and as I was not allowed to upgrade to RHEL 6.4, I decided to downgrade the VMware Tools to ESX 5.0. According to this table, I uninstalled VMware tools 9.0.15 and installed back 8.6.11.26309 (build-1310128). But this old version did not solve the problem... the virtual machines get on frozen... every two weeks.

The /etc/vmware-tools/tools.conf contains

[vmbackup]
enableSyncDriver = false

[logging]
log = true
vmtoolsd.level = warning
vmtoolsd.handler = vmx
vmss.level = warning
vmss.handler = vmx

The logging section is included according to this VMware document to get some log messages in the ESXi host.

According to Symantec README file, the script /usr/sbin/pre-freeze-script contains

/opt/SYMCquiesce/bin/freeze

The vmware.log file, located in the ESXi host datastore, contains the following lines when the snapshot is succesfully created:

2016-01-11T23:39:27.849Z| vmx| I120: Vix: [12153447 vigorCommands.c:481]: VigorSnapshotManager_Take: takeOptions=136, clientFlags=0, displayName=NBU_SNAPSHOT backupcentral 1452555565    
2016-01-11T23:39:27.879Z| vmx| I120: SnapshotVMX_TakeSnapshot start: 'NBU_SNAPSHOT backupcentral 1452555565', deviceState=0, logging=0, quiesced=1, forceNative=0, tryNative=1, sibling=0 saveAllocMaps=0 cb=18CA48E0, cbData=19CADF40
2016-01-11T23:39:27.893Z| vmx| I120: DISKLIB-VMFS  : "/vmfs/volumes/5314ce6f-2d501c33-de3c-a41f72000c75/myrhel61/solr-flat.vmdk" : open successful (65557) size = 188978561024, hd = 0. Type 3
2016-01-11T23:39:27.893Z| vmx| I120: DISKLIB-VMFS  : "/vmfs/volumes/5314ce6f-2d501c33-de3c-a41f72000c75/myrhel61/solr-flat.vmdk" : closed.
2016-01-11T23:39:27.927Z| vcpu-2| I120: ToolsBackup: changing quiesce state: IDLE -> STARTED
2016-01-11T23:39:29.692Z| vcpu-2| I120: ToolsBackup: changing quiesce state: STARTED -> COMMITTED
2016-01-11T23:39:29.693Z| vcpu-0| I120: Destroying virtual dev for scsi0:0 vscsi=10159
2016-01-11T23:39:29.693Z| vcpu-0| I120: VMMon_VSCSIStopVports: No such target on adapter
2016-01-11T23:39:29.742Z| vcpu-0| I120: SnapshotVMXTakeSnapshotCB: Enter checkpoint status 1, mode 0.

But when it fails, it appears the VMware tools does not respond -- pay attention on the lines after IDLE -> STARTED:

2016-01-18T23:03:40.412Z| vmx| I120: Vix: [12153447 vigorCommands.c:481]: VigorSnapshotManager_Take: takeOptions=136, clientFlags=0, displayName=NBU_SNAPSHOT backupcentral 1453158218
2016-01-18T23:03:40.432Z| vmx| I120: SnapshotVMX_TakeSnapshot start: 'NBU_SNAPSHOT backupcentral 1453158218', deviceState=0, logging=0, quiesced=1, forceNative=0, tryNative=1, sibling=0 saveAllocMaps=0 cb=18CA48E0, cbData=19B04E90
2016-01-18T23:03:40.519Z| vmx| I120: DISKLIB-VMFS  : "/vmfs/volumes/5314ce6f-2d501c33-de3c-a41f72000c75/myrhel61/solr-flat.vmdk" : open successful (65557) size = 188978561024, hd = 0. Type 3
2016-01-18T23:03:40.519Z| vmx| I120: DISKLIB-VMFS  : "/vmfs/volumes/5314ce6f-2d501c33-de3c-a41f72000c75/myrhel61/solr-flat.vmdk" : closed.
2016-01-18T23:03:40.522Z| vcpu-0| I120: ToolsBackup: changing quiesce state: IDLE -> STARTED
2016-01-18T23:03:58.715Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-01-18T23:04:03.814Z| vmx| I120: Tools: Tools heartbeat timeout.
2016-01-18T23:04:13.716Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-01-18T23:04:13.716Z| vmx| I120: GuestRpc: app toolbox's second ping timeout; assuming app is down
2016-01-18T23:04:13.717Z| vmx| I120: ToolsBackup: changing quiesce state: STARTED -> DONE
2016-01-18T23:04:13.717Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: done with snapshot 'NBU_SNAPSHOT backupcentral 1453158218': 0
2016-01-18T23:04:13.717Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Snapshot 0 failed: Failed to quiesce the virtual machine (40).
2016-01-18T23:04:13.718Z| vmx| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2016-01-18T23:04:13.718Z| vmx| I120: GuestMsg: Channel 0, Cannot unpost because the previous post is already completed
2016-01-18T23:04:13.718Z| vmx| I120: GuestRpc: Channel 0 reinitialized.
2016-01-18T23:04:13.718Z| vmx| I120: GuestRpc: Channel 0 reinitialized.

Does anyone have any idea to solve this problem?

Thanks in advance.

PD: some other RHEL5 virtual machine gets also frozen.

update 1

The log files of the SYMCquiesce software do not exist on the dates when a "freezing" backup session ocurred. On the days when the backup session worked successfully, the log files contain the following message:

Unable to allocate shared memory

Here you see a calendar of the last backup sessions:

    date       reboot   log file
------------   ------   --------
Feb 22, 2016    no         yes
Feb 29, 2016    yes        no
Mar  7, 2016    no         yes
Mar 14, 2016    no         no
Mar 21, 2016    yes        no
Mar 28, 2016    no         yes
Apr  4, 2016    yes        no

On March 14th everything went fine because no backup session was launched.

ultimate update

After migrating to VMware 5.5 version, this problem disappeared.

Jdamian
  • 285
  • 3
  • 19
  • What does Symantec support say? – ewwhite Feb 16 '16 at 12:36
  • Are (or were) the VMtools running when the snapshot was taken? The log entry "Tools: Tools heartbeat timeout" might mean they are not (or had recently died). – dan_linder Feb 16 '16 at 23:00
  • @ewwhite, Symantec says the SYMCquiesce software is properly configured. – Jdamian Feb 18 '16 at 14:58
  • @dan_linder, I guess they might be. I also suspected the VMware tools were dead, then I added a line in the `crontab` to restart VMware tools just a couple of hours before the backup job was started... but that did not work. – Jdamian Feb 18 '16 at 15:02
  • What is the contents of the `/opt/SYMCquiesce/logs/*.log` file(s) when the quiesce fails? – dan_linder Apr 07 '16 at 19:05
  • @dan_linder, I replied in the main text. – Jdamian Apr 08 '16 at 12:16
  • You have a very clear need to bring your systems up to date. You have evidence that this will fix the problem. And *you're missing five years of critical and less critical updates*. Put more work into getting this "allowed". – Michael Hampton Apr 08 '16 at 16:48

0 Answers0