0

Solution inline

We encountered a strange issue and are basically out of ideas by now:

We setup a galera cluster (3 Nodes + MaxScale LB) for a customer and he reported slowness. We were unable to identify the issue so we setup a tes scenario to dig deeper:

  • We cloned the complete cluster + the application server in a seperate subnet to prevent any interference by/to current users
  • We managed to reproduce the slowness: The operation was ~10s
  • In order to reduce variables we installed the application on one of the cluster nodes to allow us to do tests using the db connection to localhost

After extensive testing, tweaking and researching we decided to give the same setup a try on VmWare ESX. So we migrated the Cluster+Application to ESX and did the exact same tests - with weird results...

From there we did following tests:

Test Result HyperV Result ESX
App -> Load Balancer 10s 6s
App -> Direct DB (localhost) 6.5s 3,6s
App -> Direct DB (other node) 9s 5s
App -> localhost; no cluster 1.5s 1.3s
App (HyperV) -> LB (ESX) 13s

What we tried without any real change in results:

  • move all cluster nodes onto the same hardware
  • switch the maxscale between round robin and read-write-split
  • apply various mariadb/galera seetings
  • applied various settings in hyperV
    • VMQ settings 1 2
    • SET Virtual Switch
    • Jumbo Frames
    • instead of bond we added a physical network card
    • activate switch internal instead of using NIC
    • installed all the latest patches and updated network card drivers
    • install the linux-cloud-tools and linux-azure kernel

Following setup:

  • HyperV Windows server 2019
  • MariaDb on ubuntu 20.04
  • All-Flash HD
  • 16GBit Fibre Channel
  • Inter Network card
  • Load on the host (and the VM acutally) was neglible

We are completely stumped because we cannot explain why there is such a huge difference in timings between hyperV and ESX. We figure it must be an Network IO, but cannot figure out which setting is at fault.

From the numbers/test we could conclue what parts ar enot at fault:

  • HD/IO: since the performance drops drastically each time we add a "network" node
  • CPU: the numbers are reproducable, and we did did our tests on a VM whithout any other load
  • Slow DB Queries: since the numbers change depending on if we connect directly to one of cluster nodes or using localhost - that can be excluded

Can anyone give us pointers that what else we can try or how to speed up hyperv? or are we messing up some galera/maxscale settings?

Edit: We checked for bad segments and found (netstat -s | grep segments):

HyperV ESX
Received 2448010940 2551382424
Sent 5502198473 2576919172
Retransmitted 9054212 7070
Bad Segments 83 0
% Retransmitted 0.16% 0.00027%

Solution

Thanks to input from Mircea we finally got the numbers way down on hyperV.

Following configuration changes helped:

  • release the Default Windows Bond
  • activate SET Team
  • on the SET Team activate: RDMA and Jumbo Frames

With this the numbers on hyperV are basically equivalent to ESX

Niko
  • 108
  • 6

1 Answers1

0

For VMs, make sure you install paravirtualization drivers (Hyper-V guest integrations and VMWare Tools). Run synthetic benchmark for networking. Monitor all equipments on the path (switches, routers, hypervisors, VMs) for CPU, network counters, interrupts, context switches...

Capture traffic during application benchmarks. Check the frame size, the TCP window size, the dropped packets in TCP streams, the latency between SYN, SYN/ACK, ACK TCP handshake and compare with application latency like SQL "ping" query: SELECT 1 FROM DUAL; Monitor CPU, network, disk I/O during application benchmark.

Run benchmarks inside VMs and also on baremetal.

Some other literature: The USE Method (Utilization Saturation and Errors) and The TSA Method (Thread State Analysis)

Monitoring tools can affect performance. Check their usage (CPU, network and disk I/O). Load testing utilities are using resources too. Make sure that the workstation that is doing the load testing is not saturated.

Mircea Vutcovici
  • 17,619
  • 4
  • 56
  • 83