4

We have a strange issue in our network, which according to networkengineering.stackexchange is off-topic there, even though in my eyes it is a network problem.

We saw it the first time when we wanted to restore SQL databases to a test DB. The restore failed, in the windows log we saw iSCSI errors, the mounted iSCSI disk seems to lose the connection very often (Backup is restored with veeam - this mounts the backup file as iSCSI volume (target is physical Backup server, initiator is virtual SQL server)).

We did some testing, and it is not only a iSCSI issue, it happens when we copy files between physical servers and virtual servers. Our monitoring shows high errors during the copy process, the strange thing is that we do not see errors on the switch.

What we see on the switch port (switch is Netgear M5300) of the virtual server is "Packets Received > 1518 Octets" and "Packets Transmitted > 1518 Octets" goes through the roof when we copy large files. But "Packets RX and TX" larger then 1518 is 0. This happens only on the port of the ESX, not on the port of the other server in any test.

All ports (switch, vSwitch, portgroups, server interfaces) have the MTU set to default (1518 / 1500). We rebooted the backup server and the esx with all containing VMs, disabled and reenabled the switch ports. Wireshark on the sending server shows large packets (64kb), but according to the switch statistics this port only receives normal 1518 frames.

It seems to only happen with this one test esx, with all VMs we have on it, even if we upload files to the esx datastore.

I do not know anymore where to search. The only thing we did not yet reboot is the switch itself, since this is a core component in the network, we cannot do this during production time (and production is 24/7). We will try this on the weekend, but if anyone has a tip where to look at I would appreciate it.

EDIT: for the sake of completeness a small topology: enter image description here

EDIT2: Did some more tests: the errors are only on visible on uplink ports with multiple vlans on it. If I only use a single untagged vlan, there are no errors and no packets over 1518 anywhere.

If I now think about it, a packet with VLAN tag would have 1522 as size. But some switches do not care about this, some do - MTU is default everywhere. I do not want to stop using tagged VLANs with vmware... Any idea?

Tobias
  • 1,236
  • 1
  • 13
  • 25
  • Have you checked the jumbo frames configuration of all involved components? – Gerald Schneider Oct 17 '19 at 06:38
  • Yes. all components have the default (1500 or 1518). Switch, vSwitch, portgroup, Windows servers, esx nic... The traffic is not routed, I explicitly checked the tracert, so no additional device is involved. – Tobias Oct 17 '19 at 06:42
  • On networkengineering.SE I already answered a few other questions: https://networkengineering.stackexchange.com/questions/63050/dropped-large-packets-with-vmware – Tobias Oct 17 '19 at 06:43
  • Have you checked the [knowledge base article on that topic](https://kb.vmware.com/s/article/2039495)? – Gerald Schneider Oct 17 '19 at 06:46
  • Not yet, I will check this today. But the loss is not at OS level, even if I copy files to or from the esx datastore through the vmkernel interface I get the same result. – Tobias Oct 17 '19 at 06:47

1 Answers1

0

Obviously, the information about "Packets > 1518" has no this link to the Netgear Forum the 4 bytes for the VLAN tag are added to the MTU setting automatically, so there is no need of changing it to 1522 or something else.

Would have been better if they would not count tagged packets in when counting packtes larger than 1518...

This means our backup restore problem has another source... the search continues...

Tobias
  • 1,236
  • 1
  • 13
  • 25
  • Did you ever figure this out? I may be facing a similar issue: https://serverfault.com/questions/1056496/why-am-i-not-receiving-a-response-when-the-request-is-sent-through-a-load-balanc/1056661#1056661 The VMs run on vmware and the vlan (or something upstream) appears to be dropping packets. – Justin York Mar 11 '21 at 21:22
  • Not really. We solved the SQL restore problem by disableing iSCSI Timeouts, but why we had problems with iSCSI could never be really cleared. – Tobias Mar 12 '21 at 13:11