0

On Debian under systemd, by default KVM virtual machines under libvirt get assigned to the "machine.slice" slice.

If I then add a cpuset for this slice with cset and some custom set of CPUs, and start a VM, the VM is added to the proper cpuset, i.e.

user@host ~ $ sudo cset set --list --recurse
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-31 y       0 y   610    1 /
 machine.slice 2-15,18-31 n       0 n     0    1 /machine.slice
 machine-qemu\x2d1\x2dweb1.scope 2-15,18-31 n       0 n     0    5 /ma....scope
        vcpu1 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu1
        vcpu2 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu2
        vcpu0 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu0
     emulator 2-15,18-31 n       0 n    82    0 /machine.sli...1.scope/emulator
        vcpu3 2-15,18-31 n       0 n     1    0 /machine.sli...web1.scope/vcpu3

What I'm trying to do is replicate this behaviour with a separate slice and cpuset. However, it doesn't seem to work.

First I create the cset:

user@host ~ $ sudo cset set -c 0-1,16-17 osd.slice
cset: --> created cpuset "osd.slice"

Then I set the service I want to use the slice:

user@host ~ $ diff -u /lib/systemd/system/ceph-osd@.service /etc/systemd/system/ceph-osd@.service
--- /lib/systemd/system/ceph-osd@.service       2021-05-27 06:04:21.000000000 -0400
+++ /etc/systemd/system/ceph-osd@.service       2022-11-08 17:20:32.515087642 -0500
@@ -6,6 +6,7 @@
 Wants=network-online.target local-fs.target time-sync.target remote-fs-pre.target ceph-osd.target
 
 [Service]
+Slice=osd.slice
 LimitNOFILE=1048576
 LimitNPROC=1048576
 EnvironmentFile=-/etc/default/ceph

Then I start one of the services. If I check the service status, I do see that it's in the right slice/cgroup:

user@host ~ $ systemctl status ceph-osd@0.service
● ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/etc/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-11-08 17:22:32 EST; 1s ago
    Process: 251238 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
   Main PID: 251245 (ceph-osd)
      Tasks: 25
     Memory: 29.5M
        CPU: 611ms
     CGroup: /osd.slice/ceph-osd@0.service
             └─251245 /usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph

And just for sanity, if I check the VM transient service, it looks basically the same:

$ systemctl status machine-qemu\\x2d1\\x2dweb1.scope 
● machine-qemu\x2d1\x2dweb1.scope - Virtual Machine qemu-1-web1
     Loaded: loaded (/run/systemd/transient/machine-qemu\x2d1\x2dweb1.scope; transient)
  Transient: yes
     Active: active (running) since Tue 2022-11-08 17:03:57 EST; 22min ago
      Tasks: 87 (limit: 16384)
     Memory: 1.7G
        CPU: 4min 33.514s
     CGroup: /machine.slice/machine-qemu\x2d1\x2dweb1.scope
             └─234638 /usr/bin/kvm -name guest=web1,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-web1/master-key.aes -machine pc-i440fx-2.7,accel=kvm,usb=off,dump-guest-core=off,memory-ba>

However and this is where I'm stuck: if I then check cset again, the "tasks" are not assigned to the slice cset as I would expect; they are part of the root cset instead, and the slice cset has 0 tasks and 0 subs:

user@host ~ $ sudo cset set --list --recurse
cset: 
         Name       CPUs-X    MEMs-X Tasks Subs Path
 ------------ ---------- - ------- - ----- ---- ----------
         root       0-31 y       0 y   622    2 /
    osd.slice  0-1,16-17 n       0 n     0    0 /osd.slice

I can see nothing obvious about how machine.slice is doing this, no reference to it in the actual machine.slice unit file, nor anything in the transient scope units.

How can I get this new, custom slice/cgroup to emulate what machine.slice is doing, and force anything under it into this cpuset?

As an addendum for the "why"/X-to-my-Y, I've tried do something like spawn the ceph-osd process in the cset manually using cset proc --exec command, but this doesn't work reliably (sometimes it just fails entirely with "cannot move"), and even if it does work, its threads end up stuck in the root cset afterwards even if the main process is moved. So it seems to be that I need a way to make systemd treat the entire unit as part of the cset, before the actual process ever starts (unlike the cset proc command which spawns it, forks it, then alters it), which looks like what is done with machine.slice here.

Joshua Boniface
  • 346
  • 3
  • 14

1 Answers1

0

I ended up abandoning cset as the ideal way to do this. The fact that it requires the old v1 cgroup hierarchy and hasn't been significantly updated in years played a major part in that, as did this bug in particular causing me to look more into systemd's options.

I then found systemd's integrated AllowedCPUs directive, which also seems to do exactly what I wanted, especially when deployed at the slice level.

Going this way, I created several drop-in slice units in /etc/systemd/system for each of the various subsystems I wanted to isolate (system.slice for the majority of tasks to one cpuset, osd.slice for my OSD processes, and machine.slice for the VMs), each setting an AllowedCPUs with the specified limit as well as enabling Delegate to be sure. One reboot later and as far as I can tell it's working exactly as intended.

Joshua Boniface
  • 346
  • 3
  • 14