VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

Question

I am working a Linux kernel module (VMM) to test Intel VMX, to run a self-made VM (The VM starts in real-mode, then switches to 32bit protected mode with Paging enabled).
The VMM is configured to NOT use rdtsc exit, and use rdtsc offsetting.
Then, the VM runs rdtsc to check the performance, like below.

static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
    __asm__ volatile(
            "cpuid"
            :"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
            :"a"(code)
            :"cc");
}

uint64_t rdtsc(void)
{
        uint32_t  lo, hi;
        // RDTSC copies contents of 64-bit TSC into EDX:EAX
        asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
        return (uint64_t)hi << 32 | lo;
}

void i386mode_tests(void)
{
    u32 eax, ebx, ecx, edx;
    u32 i = 0;

    asm ("mov %%cr0, %%eax\n"
         "mov %%eax, %0  \n" : "=m" (eax) : :);

    my_printf("Guest CR0 = 0x%x\n", eax);
    cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
    vm_tsc[0]= rdtsc();
    for (i = 0; i < 100; i ++) {
        rdtsc();
    }
    vm_tsc[1]= rdtsc();
    my_printf("Rdtsc takes %d\n", vm_tsc[1] - vm_tsc[0]);
}

The output is something like this,

Guest CR0 = 0x80050033
Rdtsc takes 2742

On the other hand, I make a host application to do the same thing, like above

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
        __asm__ volatile(
                        "cpuid"
                        :"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
                        :"a"(code)
                        :"cc");
}

uint64_t rdtsc(void)
{
        uint32_t  lo, hi;
        // RDTSC copies contents of 64-bit TSC into EDX:EAX
        asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
        return (uint64_t)hi << 32 | lo;
}

int main(int argc, char **argv)
{
        uint64_t     vm_tsc[2];
        uint32_t eax, ebx, ecx, edx, i;

        cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
        vm_tsc[0]= rdtsc();
        for (i = 0; i < 100; i ++) {
                rdtsc();
        }
        vm_tsc[1]= rdtsc();
        printf("Rdtsc takes %ld\n", vm_tsc[1] - vm_tsc[0]);

        return 0;
}

It outputs followings,

Rdtsc takes 2325

Running above two codes in 40 iterations to get the average value as followings,

avag(VM)   = 3188.000000   
avag(host) = 2331.000000

The performance difference can NOT be ignored, when running the codes in VM and in host. It is NOT expected.
My understanding is, using TSC offsetting + no RDTSC exit, there should be little difference in rdtsc, running in VM and host.
Here are VMCS fields,

 0xA501E97E = control_VMX_cpu_based  
 0xFFFFFFFFFFFFFFF0 = control_CR0_mask  
 0x0000000080050033 = control_CR0_shadow

In the last level of EPT PTEs, bit[5:3] = 6 (Write Back), bit[6] = 1. EPTP[2:0] = 6 (Write Back)
I tested in bare-metal, and in VMware, I got the similar results.
I am wondering if there is anything I missed in this case.

That's a *very* short test; have you checked the asm to make sure it won't include any page-fault costs which might differ? Have you tried making it *much* longer (like 100M iterations) and timing it with wall-clock time (plenty of time for CPU to turbo to max frequency and hide any startup-overhead effects )? Or with performance counters for core clock cycles instead of reference cycles, regardless of CPU frequency variation. (`perf stat ./a.out`) — Peter Cordes, Jun 07 '18 at 03:57
What is the basis of your belief that there should be little difference in performance? Do you have a reference? — prl, Jun 07 '18 at 05:59
you say you tested with VMware and got “similar” results—can you be more specific? What did you run and what were the results similar to? — prl, Jun 07 '18 at 06:01
@prl: Intuitively, I would have expected `rdtsc` to either vmexit or decode + run the same way on bare metal vs. guest. Seems like a reasonable hypothesis to start from, but yeah it's possible that it's false. — Peter Cordes, Jun 07 '18 at 08:43
@prl, I run the 2 tests on both Phys Linux host and VMware VM (as Virt. host with the same Linux kernel). In both cases, my VMM+VM test takes more TSC ticks than running in host. Then in the Phys host, I start a KVM VM (running Linux), and run the 2nd code in it, I found running in KVM takes almost the same time as running in the same Phys host. I expect to get this result with my VMM+VM. For the page fault (EPT) overhead, my VM code is loaded into a 2MB host page by my VMM, which is big enough for current VM, so I don't think there is PF, Thx. — wangt13, Jun 07 '18 at 11:09
@Peter, I changed the 100 to 1M, and re-ran the 2 tests, my VM took about 5-10% more than the native run. I would check if perf stat can help in my VM case. — wangt13, Jun 07 '18 at 11:35

VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

0 Answers0