0

I have a Linux server with 48 CPU and, from time to time, some of them starts hitting almost 100% of usage. And when that happens, the usage doesn't go down and that's affecting some services performance. When I reboot this server, everything goes ok for several days when some starts hitting almost 100% of usage again. This is like a cycle. The problem is that the usage of some CUP, when hits 100%, doesn't change, unless I reboot server, and there is a service that's performing badly when that occurs.

When I use the htop command, I can see some process ksoftirq/0, ksoftirq/1, ksoftirq/2 ... is the responsible for this usage. I don't know what to do.

One approach that I tried was: I observed that the number of CPU where that occurs tends to be in the first half (0-23). I tried to change the affinity of those process (ksoftirq) to another CPU in the second half (24-47), but without any success.

What I want is: how to low this ksoftirq process usage or, at least, how to distribute them in order to keep CPU with a usage lower?

I'm not sure if the solution is about changing the affinity or changing some other attribute. I'm really clueless in this situation.

This is a physical server. Some info about it:

output of lsb_release -a:

No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:    11
Codename:   bullseye

output of uname -a:

Linux tiamat-dc 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08) x86_64 GNU/Linux

output of lscpu:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   44 bits physical, 48 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       4
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           46
Model name:                      Intel(R) Xeon(R) CPU           E7530  @ 1.87GHz
Stepping:                        6
CPU MHz:                         1239.565
BogoMIPS:                        3723.93
Virtualization:                  VT-x
L1d cache:                       768 KiB
L1i cache:                       768 KiB
L2 cache:                        6 MiB
L3 cache:                        48 MiB
NUMA node0 CPU(s):               0,4,8,12,16,20,24,28,32,36,40,44
NUMA node1 CPU(s):               1,5,9,13,17,21,25,29,33,37,41,45
NUMA node2 CPU(s):               2,6,10,14,18,22,26,30,34,38,42,46
NUMA node3 CPU(s):               3,7,11,15,19,23,27,31,35,39,43,47
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp l
                                 m constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx
                                 16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida flush_l1d
Paulo
  • 101
  • 2

0 Answers0