2

I have several servers (different ages) running on Ubuntu 22.04 with QEMU and libvirt.
When migrating VMs between some of these servers, the VM immediately freezes on the target system.
Tested guests were HVM machines running current Ubuntu or CentOS/Rocky Linux.

Previously the servers ran on Ubuntu 18.04 and were reinstalled recently. On 18.04 there were no problems with live-migration - this only started to happen on 22.04.

Details about the hardware:
I have multiple servers, but I just pick four of them to better compare:
1: AMD EPYC 7513 @ 1.50 GHz 2: Intel Xeon Silver 4114 CPU @ 2.20GHz
3: Intel Xeon CPU E5-2630 v4 @ 2.20GHz
4: Intel Xeon CPU E5-2620 v4 @ 2.10GHz

Migration works between 1<->2 and 3<->4. Migration does not work, when migrating from 1->3 or 1->4, but still works in the opposite direction.

Details about configuration:
Virtual machines are configured to use a "SandyBridge" CPU model, as how it worked before with the older Ubuntu version.
Also, my test guest has no hard drive, no audio device, no networking. Just a Live CD device via a virtual IDE controller.

What I debugged so far:
This seems to be a timing or interrupt related problem. The machine freezes and doesn't show any reaction to keyboard input (via VNC).
Also, the following is logged on the target host:
qemu-system-x86_64: warning: TSC frequency mismatch between VM (2600083 kHz) and host (3207181 kHz), and TSC scaling unavailable
I stumbled across this report and tried to set a fixed TSC frequency for the guest (with different values), but no success so far.
I also tried to disable TSC completely for the guest, making it relying on HPET, which makes the VM really laggy and throws warnings about stuck IRQ events on the console, sometimes even a kernel panic caused by the NMI watchdog. Also, after migration, the VM completely froze.

What I discovered:
If the guest is "running" the GRUB bootloader or a Live CD menu (so, it has not booted yet), the VM is migratable without problems. Booting the guest into recovery mode still leads to a frozen VM.
I tried noapic acpi=off notsc boot options inside the guest, but without change.

What could be causing this?
At the moment, I'm about 50% convinced, this could be related to the use of TSC by QEMU, leading to a freezing system, if TSC clocks between servers deviate too much. But it seems, I can't simply forbid the use of TSC to QEMU, even if I boot the host with the notsc kernel parameter.
But, of course, it could be related to something totally different.
Is there a way to force QEMU to use HPET or any other clock source? Or is there anything else I could try to make migrations working again?

Thanks!

Max
  • 31
  • 2
  • if i remember correctly, the issue might be the difference between the host cpu. they must be identical, this is what I remember what Proxmox says about the same – djdomi Sep 09 '22 at 18:07

2 Answers2

1

I experienced the same problem after migrating my KVM hosts to Ubuntu 22.04.

Solution

The Solution in my case was an upgrade to the hwe-22.04 kernel (at the time of writing: 5.19.0-32-generic).

Since Ubuntu Server installations don't get the hwe kernel automatically, I used the following command to install it manually:

apt install --install-recommends linux-generic-hwe-22.04

Hope this works for you too. Good luck!

Problem Description

When starting the migration of a guest to a target host, guests already running on that host may experience a leap of their system clock causing the guest kernel to freeze. The migrated guest usually experiences that leap of the system clock shortly after the migration completed.

Probable Cause

The reason for the clock leap seems to be a regression that causes side effects in the tsc scaling when multiple guests share the same cpu core.

This patch fixes the tsc behaviour and is included in the Ubuntu hwe-22.04 kernel but not in the original Ubuntu 22.04 kernel.

References

Andreas
  • 11
  • 3
0

i have exactly the same problem !

since i re-installed my hosts in 2204, live migrating from a Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz to a Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz freeze the migrated guest.

the opposite direction works fine

it appears to be a ubuntu2204 kernel problem. if i downgrade the server kernel to 5.4.0-139-generic ( the last kernel from ubuntu2004 ) without changing anything else the problem does not happen any more , you can live migrate in both direction without issue.

i have also done some test with ubuntu mainline kernels

5.15.94-051594-generic is OK
5.16.0-051600-generic has the problem, so i guess the issue appears with 5.16 kernels, and was backported by ubuntu team to the 5.15 in ubuntu2204.

hope this helps