0

I have a Dell Poweredge R630 with 4 drives in a RAID. I'm not sure if it's RAID 10 or RAID 5 because I didn't order or set up the server originally and I'm just the default network admin, it's not my primary job. The server is running vSphere Essentials ESXi 6.7 and it hosts half a dozen VMs.

I use Altaro VM backup running in a VM on another host to backup this host as well as an ESXi 6.5 host. When I started backing up the VMs on this host I found that the backups would randomly fail. Any given night 2 or 3 of the 5 VMs I'm backing up would fail but not the same VMs each night. A couple weeks ago they started to always fail.

In working with Altaro support to find out why it was failing they found this in the Altaro logs:

2019/09/24 00:11:31.034: DISKLIB-LINK : "san://snapshot-155[Storage] VMName/VMName.vmdk@192.168.1.1:443?User@domain.local/XXX" : failed to open (Unknown error). 
2019/09/24 00:11:31.034: DISKLIB-CHAIN : "san://snapshot-155[Storage] VMName/VMName.vmdk@192.168.1.1:443?User@domain.local/XXX" : failed to open (Unknown error). 
2019/09/24 00:13:18.446: VixDiskLib: Detected DiskLib error 2338 (NBD_ERR_NETWORK_CONNECT). 
2019/09/24 00:13:18.446: VixDiskLib: VixDiskLib_Read: Read 437 sectors at 19619760 failed. Error 14009 (The server refused connection) (DiskLib error 2338: NBD_ERR_NETWORK_CONNECT) at 5235.

Their support says these log entries, I assume the last line in particular, came directly from the host.

Not being an ESXi expert I'm not totally sure which log files to look at in ESXi to try to figure out what is going wrong, to confirm it's a drive problem on the host, and to determine which drive it is so I can replace it. So far the vCenter is not raising any alerts or warnings about a drive problem and the host is not indicating a problem with the array.

Another data point: Most of these VMs are running Windows. Each of those is running Windows backup internally to a separate drive and those all complete with no errors. I find it interesting that Windows is able to backup its drives from inside the VM but there is a read error when ESXi is making the backup from outside.

Steve Hiner
  • 143
  • 2
  • 9

1 Answers1

2

It's not a host hard drive problem. The log file is telling you that it failed to open the virtual hard drive of the VM because of a network error.

My guess is that the backups of the VM's that are on the same host as the Altaro backup probably don't fail. Is that right?

joeqwerty
  • 109,901
  • 6
  • 81
  • 172
  • That is correct. The backups that are on a different host than Altaro also do not fail. So maybe this is a network issue I need to solve then. I'll check how my network is set up to see if it could be better. A network problem would make more sense for the random failures (e.g. one night the backup worked for VMs A, B and C and the next night it worked on B, D and E). – Steve Hiner Sep 26 '19 at 15:49
  • 1
    After reviewing how my servers are cabled to the network I think you are right. That server was cabled very differently than the ones that work. I'm reworking the network cables a bit and going to reconfigure the virtual networking in vSphere and I think it will resolve multiple issues. I'm sure it will temporarily create issues as I figure out how it should be but in the end I think it will be a lot better. Thanks a bunch. – Steve Hiner Sep 27 '19 at 23:54
  • Glad to help... – joeqwerty Sep 28 '19 at 00:08
  • After moving the cables around and re-configuring IP addresses for the hosts and vCenter, all those backup problems have disappeared, confirming this as the solution. – Steve Hiner Oct 24 '19 at 17:23
  • Glad you got it resolved. – joeqwerty Oct 24 '19 at 18:27