VMWare ESXi, RHEL, LUKS and network latency

Question

My company is running into a network performance problem that seemingly has all of the "experts" we're working with (VMWare support, RHEL support, our managed services hosting provider) stumped.

The issue is that network latency between our VMs (even VMs residing on the same physical host) increases--up to 100x or more!--with network throughput. For example, without any network load, latency (measured by ping) might be ~0.1ms. Start transferring a couple 100MB files, and latency grows to 1ms. Initiate a bunch (~20 or so) concurrent data transfers between two VMs, and the latency between the VMs can increase to upwards of 10ms.

This is a huge problem for us because we have application server VMs hosting processes that might issue 1 million or so queries against a database server (different VM) per hour. Adding a millisecond or two to each query therefore increases our runtime substantially--sometimes doubling or tripling our expected durations.

We've got what I would think is a pretty standard environment:

ESXi 6.0u2
4 Dell M620 blades with 2x Xeon E5-2650v2 processors and 128GB RAM
SolidFire SAN

And our base VM configuration consists of:

RHEL7, minimal install
Multiple LUNs configured for mount points at /boot, /, /var/log, /var/log/audit, /home, /tmp and swap
All partitions except /boot encrypted with LUKS (over LVM)

Our database server VMs are running Postgres 9.4.

We've already tried the following:

Change the virtual NIC from VMNETx3 to e1000 and back
Adjust RHEL ethernet stack settings
Using ESXi's "low latency" option for the VMs
Upgrading our hosts and vCenter from ESX 5.5 to 6.0u2
Creating bare-bones VMs (setup as above with LUKS, etc., but without any of our production services on them) for testing
Moving the datastore from the SSD SolidFire SAN to local (on-blade) spinning storage

None of these improved network latency. The only test that showed expected (non-deteriorating) latency is when we set up a second pair of bare-bones VMs without LUKS encryption. Unfortunately, we need fully encrypted partitions (for which we manage the keys) because we are dealing with regulated, sensitive data.

I don't see how LUKS--in and of itself--can be to blame here. Rather, I suspect that LUKS running with some combination of ESX, our hosting hardware, and/or our VM hardware configuration is to blame.

I performed a test in a much wimpier environment (MacBook Pro, i5, 8GB RAM, VMWare Fusion 6.0, Centos7 VMs configured similarly with LUKS on LVM and the same testing scripts) and was unable to reproduce the latency issue. Regardless of how much network traffic I sent between the VMs, latency remained steady at about 0.4ms. And this was on a laptop with a ton of the things going on!

Any pointers/tips/solutions will be greatly appreciated!

score 2 · Answer 1 · answered Aug 02 '16 at 23:18

After much scrutiny and comparing the non-performing VMs against the performant VMs, we identified the issue as a bad selection for the advanced "Latency Sensitivity" setting.

For our poorly performing VMs, this was set to "Low". After changing the setting to "Normal" and restarting the VMs, latency dropped by ~100x and throughput (which we hadn't originally noticed was also a problem) increased by ~250x!

VMWare ESXi, RHEL, LUKS and network latency

1 Answers1