9

We have an Heartbeat/DRBD/Pacemaker/KVM/Qemu/libvirt cluster consisting of two nodes. Each node runs Ubuntu 12.04 64 Bit with the following packages/versions:

  • Kernel 3.2.0-32-generic #51-Ubuntu SMP
  • DRBD 8.3.11
  • qemu-kvm 1.0+noroms-0ubuntu14.3
  • libvirt 0.9.13
  • pacemaker 1.1.7
  • heartbeat 3.0.5

The virtual guests are running Ubuntu 10.04 64 Bit and Ubuntu 12.04 64 Bit. We use a libvirt feature to pass the capabilities of the host CPUs to the virtual guests in order to achieve best CPU performance.

Now here is a common setup on this cluster:

  • VM "monitoring" has 4 vCPUs
  • VM "monitoring" uses ide as disk interface (we are currently switchting to VirtIO for obvious reasons)

We recently ran some simple tests. I know they are not professional and do not reach high standards, but they already show a strong trend:

Node A is running VM "bla" Node B is running VM "monitoring"

When we rsync a file from VM "bla" to VM "monitoring" we achieve only 12 MB/s. When we perform a simple dd if=/dev/null of=/tmp/blubb inside the VM "monitoring" we achieve around 30 MB/s.

Then we added 4 more vCPUs to the VM "monitoring" and restartet it. The VM "monitoring" now has 8 vCPUs. We re-ran the tests with the following results: When we rsync a file from VM "bla" to VM "monitoring" we now achieve 36 MB/s. When we perform a simple dd if=/dev/null of=/tmp/blubb inside the VM "monitoring" we now achieve around 61 MB/s.

For me, this effect is quite surprising. How comes that apparently adding more virtual CPUs for this virtual guest automatically means more disk performance inside the VM?

I don't have an explanation for this and would really appreciate your input. I want to understand what causes this performance increase since I can reproduce this behaviour a 100%.

PythonLearner
  • 1,032
  • 2
  • 12
  • 31
  • 2
    Use a purpose-built benchmarking tool like [iozone](http://www.iozone.org/) or [bonnie++](http://www.coker.com.au/bonnie++/) to help eliminate other variables. – ewwhite Dec 13 '12 at 17:28
  • It would be interesting how the actual CPU loads look ... is something cpu bound introduced in a hidden place (rsync plus probably ssh certainly is to an extent, so are the network drivers introduced that way, also dd might do unexpected cpu bound things...), or is it actually things suboptimally *waiting* for each other due to less execution threads available? – rackandboneman Dec 14 '12 at 01:46
  • 3
    run `kvm_trace` to see how the number of `IO_Exits` changes when you change the CPU numbers. I would guess it's because you are using IDE, which gets scheduled with the guest CPUs. With virtio the performance should be consistent, and when data-plane is in qemu, it will get a drastic boost. Another guess can be at the fact that you are using a distribution that is known for a buggy virtualization stack. – dyasny Dec 14 '12 at 05:36
  • @ ewwhite: Yes, running professional tests would be a good choice. However, I want to understand first why this I/O behaviour occurs. @ rachandboneman: When I looked last, the 4 CPUs had a very high wait value (around 70-80 %). @dyasny: Thanks, I will try that. How can I check that data-plane is activated/currently used? – PythonLearner Dec 14 '12 at 08:07
  • data-plane is experimental for now, and I am pretty certain the first distribution to pick it up will be Fedora. http://pl.digipedia.org/usenet/thread/11769/28329/ – dyasny Dec 14 '12 at 15:18
  • @dyasny Can you point me to a website which describes the link between IDE driver and CPU scheduling? I wasn't able to find something through googling. – PythonLearner Dec 18 '12 at 08:19
  • like I said, that's more of a guess, that can be verified by reading the actual code. AFAIK, the threads that process IO are spawned by the kvm vcpu threads, which is the exact issue data-plane come in to solve – dyasny Dec 18 '12 at 09:09

1 Answers1

9

I will give very rough idea/explanation.

In OP situation, besides measuring within the VM, the host should be look at too.

In this case, we can assume the following are correct

  1. In all the test, the host I/O(disk) bandwidth is not max out. As VM("monitoring") I/O increases with more CPUs allocated to it. If host I/O was already max out, there should be no I/O performance gain.
  2. "bla" is not the limiting factor As "monitoring" I/O performance improved without changes to "bla"
  3. CPU is the main factory for performance gain(in OP case) Since I/O is not the bottle neck, and OP not mention any memory size changes. But why? Or how?

Additional factor

  1. Write take more time than Read This is the same for VM and for host. Put it in extremely simple terms: VM wait for host to finish read and write.

What happen when more cpu assigned to "monitoring"?

When "monitoring" is allocated more CPUs, it gain more processing power, but it also gain more processing time for I/O.

This has nothing to do with rsync as it is a single thread program.

It is the I/O layer utilizing the increased CPU power, or more precisely, the increased processing time.

If cpu monitoring program (eg. top) is used on "monitoring" during test, it will show not one, but all cpu usage go up, and also %wa. %wa is wait time spend on I/O.

This performance increase will only happen when your host I/O is not max. out.

I cannot find the cpu scheduling in KVM site, but there is this blog mentioning KVM is using CFS and cgroups, following is the quote

Within KVM, each vcpu is mapped to a Linux process which in turn utilises hardware assistance to create the necessary 'smoke and mirrors' for virtualisation. As such, a vcpu is just another process to the CFS and also importantly to cgroups which, as a resource manager, allows Linux to manage allocation of resources - typically proportionally in order to set constraint allocations. cgroups also apply to Memory, network and I/O. Groups of processes can be made part of a scheduling group to apply resource allocation requirements to hierarchical groups of processes.

In a nutshell, more cpu = more cpu time = more I/O time slot in a given period of time.

John Siu
  • 3,667
  • 2
  • 17
  • 23
  • Thank you for writing this answer. "More vCPUs means more processing time for I/O" is the explanation I was looking for. Worth the bounty! – PythonLearner Dec 25 '12 at 13:51