2

I have been dealing with this problem for about half a year (had the luxury of time) and hadn't managed to crack it, so I finally resigned and came over here to ask for help from others not just google (our support for vmware run out about 3 years ago and our executives chose not to order prolongation from vmware).

The problem

I have not been dealing with the performance of the virtualisation or VMs, that all works fine. I really got stabbed in the back when I needed to setup new backup software for the VMs. The hosts, storages and backup servers are all equipped with 10GigEth NICs and are connected to the same 10Gig switch. When I want to copy a VMDK from the host and its iSCSI attached storage to the backup server the speed is stable 150Mbit/s. The amount I have to backup each night is about 2-5 TB and with that speed it is not possible. The goal is to increase the copy speed to at least 100MB/s (5TB in about 14 hours).

Topology

  • Network X 192.168.xxx.0/24
  • Network Y 10.0.yyy.0/24
  • Corporate network (we do not manage this, we only use it) which includes various vlans for physical devices and VMs.
    • Network VLAN A
    • Network VLAN B

Cluster topology

Cluster topology

The 10 Gig Dell switch is really the heart of the cluster, since everything is connected to it by Cat6 cable. The SW2 switch is daisy-cahined to it and serves as connecting point for redundant connection from ESXi host to the X network. There are no other vlans then 1(default) configured on any of those switches. Hosts and servers are all connected to the VLAN A (or B) to be accessible from our offices and have access to the internet as well as to the rest of the corporate network. The Datasotres for the cluster are Those Dell(SFP) and HP(Copper) storages all connected by iSCSI to all five hosts. All ESXi hosts and servers have a copper Cat5 link to the SW3 into network Y where all the BMCs and other management ports are connected too. One of the backup servers has routing enabled to grant acces to the internet on the X network through the VLAN A network. vMotion is enabled on networks X and VLAN A. All 10Gig NICs from devices on network X have jumbo frames enabled and are reporting 10Gb speed full-duplex

The tests

I was testing quite a few backup softwares and since the testing rig had just 100Base NIC I didnt see a problem with network performance then, but when we bought the software and I discovered that the speed won't go further then 150Mbit/s I realised that I need to do some tweaking. What I tried follows. Each test's result speed was 150Mbit/s unless otherwise specified.

  1. This is a desired usage example. Backup servers are connecting over network X to a host and download all backups (in the form of snapshots) to local storage and/or NAS storage.
  2. I created a direct link from one of the host's 10Gig ports to the backup server's 10Gig port and tried SCP, WINSCP, SSH and the backup software to download a VM snapshot from the Dell storage.
  3. I created a NFS storage on one of the Backup servers and migrated a test VM over to it (~500MB/s, 20GB, stable), then I tried methods in Test 2 again.
  4. I disconnected host ABC(network VLAN A) from the cluster and reconnected it as XYZ(Network X), removed its connection to network VLAN A and its 1Gig connection to X and tried Test 3 again. Migration (~500MB/s, 20GB, stable).
  5. I fiddled with virtual switch settings and with bandwith policy while trying Tests 1, 3 and 4.
  6. I tried running 20 backup jobs simultaneously and each of them ran at 150Mbit/s. I then begun starting more jobs and the speed on all of them started dropping around 30-32 simultanously running jobs, so there is at least 550MB/s of throughput available.

The infrastructure

  • Five identical Dell PowerEdge R610s (dual Xeon X5660, 200+ GB RAM, 4x GLAN (Broadcom NetXtreme II BCM5709), 1x dual 10GLAN (Intel 82599), no internal storage)
  • Three Dell PowerVault Enclosures (10 TB each, 10k SAS HDDs 600GB each)
  • One HP MSA 2040 (10 TB, three SSD SAS 300GB disks as cache, rest is 10k SAS HDDs)
  • SW1 Dell PowerConnect 8024
  • SW2 Cisco 2960G
  • SW3 Cisco 2950
  • Backup server Dell PowerEdge R530
  • Vsphere server Sunfire (something-old)

I really canť tell where the problem is, but in my opinion it would be in the ESXi. VMs can reach 500MB/s between each other on different hosts without problems, but the hosts themselves cannot.

I will really appreciate every response to this and will provide clarification to every clouded detail.

Update 1(final)

We have purchased a license of Veeam and configured incremental backups and timed the backups so they don't overlap much. We have virtually eliminated the problem by that setup, but the slow speed remains almost the same per connection. Bottleneck is identified as source and we can confidently track the data flow from start to end. We have dug through every network setting in every device or vm that had anything to do with the flow and found nothing. The only thing we can safely state is that the problem lies within the esxi5.5 host and its iscsi connected datastores.

This problem will remain a mystery since we are rotating out of this environment and we would repurpose it significantly. Therefore this question would probably be left without answer.

LANeo
  • 23
  • 1
  • 6
  • What backup software are you using? – Stuggi Apr 22 '20 at 19:33
  • @Stuggi: Its Iperius Backup. I tried Veeam, Vembu, Nakivo and this one. All were copying files with the same speed, so I don't think it's a problem in the software. – LANeo Apr 27 '20 at 07:54
  • I'm not familiar with that product, but with Veeam you generally set up a proxy VM that the VM snapshots get mounted to, and this VM then handles the copying instead of the host, as such you get the same network performance as any other VM. Is there any way to do that with Iperius? – Stuggi Apr 27 '20 at 08:24
  • I think I might achieve something similar, I can do a clone or a shadow copy VM and I can backup data from inside of the VM not the VMDK. Or I could leave the copy VM off and copy the vmdk with Iperius over SMB, it would depend on what storage I will use. I have to figure out if there are no obstacles, but I will definitely try something like this. I will post an update to this. Iperius Backup does not support anything like this out of the box, but it is relatively flexible in scheduling and automation. – LANeo Apr 27 '20 at 09:23

2 Answers2

1

This might not be the advice you're expecting but that will resolve your problem ^^

The solution is to perform full backups weekly, not daily.

It's one of the first real world lessons when one starts doing backups (and verifying them :D). Large daily backups simply don't complete within a day. Long story short it's not reasonably feasible to backup TBs per day because hosts, network AND storage simply don't keep up with transfer.

Standard practice is to backup, at most, daily difference and weekly full. VmWare has built-in ways to handle incremental snapshots, vary with which edition you pay for. See in ESXi what you can configure.

VmWare will also be smarter about not re-copying the same content over the network, I bet the huge vmdk hardly change day over day. The bare minimum for large transfer is to use rsync instead of a sftp/scp, rsync only transmits diff for large files.

user5994461
  • 2,919
  • 1
  • 18
  • 31
  • I am planning to use incremental backups but I need to set CBT on all the VMs that do not have this option enabled and there is quite a lot of them. I am not going to spend that time configuring it untill I can solve the speed issue. As you said, full backups should ONLY take place weekly, but with that speed it will take more then 24 hours and I only have that 24h window to do that. I am doing a copy of those backup files by rsync to another remote location. I know why I should use rsync, but I used SCP and other stuff just to test the speed. Thanks for the answer anyway. – LANeo Apr 27 '20 at 08:14
  • I don't think the goal of doing > 1 Gbps transfer is trivial, because a single TCP stream don't do that much even on a LAN, then disks and CPU don't necessarily keep up either. (depends on the application protocol overhead). That being said 150 Mbps is quite low. so hum... – user5994461 Apr 27 '20 at 15:00
  • Try opening system metrics during a transfer, CPU/network/disk on the source and target hosts. That should hint where is the bottleneck. You mentioned jumbo frames, have you enabled jumbo frames everywhere including routers and switches? If not, there you go. Consider disabling jumbo frames because it can easily backfire making everything slower if not well configured end-to-end. You mention everything is 10 Gbps with cat 5 cables but cat 5 can't do 10 Gbps, could you double check the link speed? Try finding the info on the switch, for all ports. – user5994461 Apr 27 '20 at 15:06
  • I must say that even 1Gbit would suffice my needs, but every bps of extra speed would make things much easier. That 150Mbit/s speed also applies to 1Gbit interfaces without jumbo frames, but only from the host to the backup server, since I did not try to mount storage via the slower interfaces. I will test that today or tomorrow. During the tests, both servers were doing nothing and no resource showed extensive usage (CPU, disk, etc.) The Cat5 cable is perfectly capable of transmitting on 10Gbit speeds if its short (between patch panels - cca 2m), but I use Cat6 in the back. – LANeo Apr 29 '20 at 06:56
1

We use veeam backup. it show us where and which percent bottleneck exists in our backup infrastructure as source,network,target. Source is where data is, network is clear and target that is where we store backup.I had the same problem and found it in my storage speed and after that bottleneck changed to source and I added some backup proxies and after that network that by changing MTU we solved it.Hope it help you

hamid
  • 43
  • 3
  • Thanks for the answer, I will look into MTU settings (also mentioned in other comments) on all devices involved. I think the problem starts somewhere between storages and hosts or in the host when it goes through it. Migrating a VM disk between storages through vsphere is fast, so I can rule out disks in storages as bottleneck devices. – LANeo Apr 29 '20 at 07:09