Why does the Qemu run differ from the native run?

Question

What did i do?
I ran qemu-x86_64 -singlestep -d nochain,cpu ./dummy to dump all the registers of a dummy program after each instruction and used grep to save all the RIP values into a text file (qemu_rip_dump.txt). I then singlestepped the dummy program with ptrace and dumped the RIP values after each instruction into another textfile (ptrace_rip_dump.txt). I then compared both .txt files with diff.

What result did i expect?
I expected both runs of the dummy program to execute the same instructions, thus both dump files being the same (same rip values and same amount of rip values).

What result did i actually get?
Ptrace dumped about 33.500 RIP values and Qemu dumped 29.800 RIP values. The RIP values of both textfiles start differing from the 240. instruction, most of the rip values are identical but ptrace executes about 5500 instructions qemu doesnt execute and qemu executes about 1800 instructions ptrace doesnt execute thus resulting in a difference of about 3700 instructions. Both runs seem to execute things differently throughout the whole program, for example there is a block of 3500 instructions from the 26.500-30.000 instruction (cleanup?) that the native run executes but not qemu.

What is my qestion
Why are the RIP values not the same throughout the whole execution of the program and most importantly: What do i have to do to make both runs be the same?

Extra Info

the dummy program was a main function that returns 0, but this problem exists in every executable i have traced
i have tried forcing qemu using the ld-linux-x86-64.so.2 linker with -L /lib64/ - this had no effect
if i run qemu multiple times the dumps are the same (equal number and value of RIP), the same goes for ptrace

What happens when you run the same program natively on two different systems? — stark, Dec 10 '20 at 14:27
@stark running the code on a different system changes the number of instructions executed slightly, but the difference between ptrace and qemu stays about the same — Sbardila, Dec 10 '20 at 14:31
You would need to analyze the actual run of execution (if they diverge by insn 240 or so this will not be very difficult) to identify why. Possible causes include that the environment QEMU provides the program will not be exactly identical to the native version -- for instance the set of things it puts in the auxiliary vector are a bit different, so if the dynamic linker iterates through the auxv then it will go round a loop a different number of times. — Peter Maydell, Dec 10 '20 at 21:32
Incidentally, unless you really care about the dynamic linker you could probably just discard all the RIP values before the first insn in main() -- I suspect that would be more likely to give identical results in both cases, though there are certainly guest programs that would show a difference after main() as well. — Peter Maydell, Dec 10 '20 at 21:33
@PeterMaydell i have used qemu's in_asm logging to find out where differences are occuring. i found out that the first difference happens at `_dl_aux_init`. Other differences happen at `__tunables_init` `get_common_indices.constprop.0` `__libc_start_main` `strchr_ifunc` `tcache_init.part.0` `_dl_non_dynamic_init` `__strlen_sse2` `__mempcpy_sse2_unaligned` and `__strrchr_sse2` — Sbardila, Dec 11 '20 at 15:34
OK, so the first part of that is indeed where the dynamic linker is looking through the aux vector. Some of the others look like they are where the guest code is looking at what features the CPU supports -- on your host CPU there is SSE2 support so the guest libc picks optimised versions of functions like strlen and memcpy that use it, but QEMU doesn't support SSE2 emulation so the guest libc uses different versions. — Peter Maydell, Dec 12 '20 at 18:18

Peter Maydell · Accepted Answer · 2020-12-12T19:52:54.227

With a "does nothing" program like the one you're testing, most of the execution run will be in the guest dynamic linker and libc. Those do a lot of work behind the scenes before your program gets control, and some of that work varies between a "native" run and a "QEMU" run. There are two main sources of divergence, judging by some of the extra detail you give in the comments:

The environment QEMU provides to the guest binary is not 100% identical to that which a real host kernel provides; it's only intended to be "close enough that correct guest binaries behave in a reasonable way". For instance, there is a data structure passed to the guest called the "ELF auxiliary vector"; this contains information including "what CPU features are supported", "what user ID are you executing as", and so on. The dynamic linker iterates through this data structure at startup, so minor harmless differences in what entries are in the vector in what order will cause slightly different execution paths in the guest code.
The CPU QEMU emulates does not provide exactly the same features that your host CPU does. There's no support for emulation of AVX or SSE2, for instance. The guest libc will adjust its behaviour so that it takes advantage of CPU features when they're available, so it picks different optimised versions of functions like memcpy() or strlen() under the hood. Since the dynamic linker will end up calling these functions, this also results in divergences of execution.

You may be able to work around some of this by restricting the area of instruction tracing you look at to just starting from the beginning of the 'main' function to avoid tracing all of the dynamic linker startup. I can't think of a way to work around the differences in what CPU features are available on the host vs QEMU, though.

thank you very much for your explanation! Do you have suggestions on captions i could look for to learn more about libc behaving differently? — Sbardila, Dec 13 '20 at 18:53

Why does the Qemu run differ from the native run?

1 Answers1