0

currently I have a problem with some of my Debian 10 servers. The SLAB usage of several servers is exremely high (mostly 50% of the memory is used by slab). I can't figure out what's the problem.

Maybe someone of you has an idea?

atop

ATOP - mail                                        2019/10/14  13:00:03                                        --------------                                        10m0s elapsed
PRC | sys   31.00s  | user  30.65s |               |  #proc    277 | #trun      1  | #tslpi   728 |  #tslpu     0 | #zombie    0  | clones  1515  |              |  #exit   1396 |
CPU | sys       5%  | user      5% |  irq       0% |  idle    189% | wait      0%  | steal     0% |  guest     0% | ipc     0.95  | cycl   81MHz  | curf 2.10GHz |  curscal   ?% |
cpu | sys       3%  | user      3% |  irq       0% |  idle     94% | cpu000 w  0%  | steal     0% |  guest     0% | ipc     0.96  | cycl   83MHz  | curf 2.10GHz |  curscal   ?% |
cpu | sys       3%  | user      2% |  irq       0% |  idle     95% | cpu001 w  0%  | steal     0% |  guest     0% | ipc     0.93  | cycl   80MHz  | curf 2.10GHz |  curscal   ?% |
CPL | avg1    0.40  | avg5    0.17 |               |  avg15   0.11 |               | csw   556801 |               | intr  272694  |               |              |  numcpu     2 |
MEM | tot    13.7G  | free  247.9M |  cache 619.9M |  dirty   0.1M | buff   43.6M  | slab    7.2G |  shmem 116.1M | shrss   0.0M  | vmbal   0.0M  | hptot   0.0M |  hpuse   0.0M |
SWP | tot     1.9G  | free    0.0M |               |               |               |              |               |               |               | vmcom  13.4G |  vmlim   8.7G |
PAG | scan   17448  | steal  15442 |  stall      0 |               |               |              |               |               |               | swin      12 |  swout   3664 |

slabtop

root@mail ~ # slabtop --sort c -o
 Active / Total Objects (% used)    : 29313064 / 29993742 (97,7%)
 Active / Total Slabs (% used)      : 973029 / 973029 (100,0%)
 Active / Total Caches (% used)     : 100 / 125 (80,0%)
 Active / Total Size (% used)       : 7104170,14K / 7271337,84K (97,7%)
 Minimum / Average / Maximum Object : 0,01K / 0,24K / 8,00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
3812863 3799719  99%    0,20K 200677       19    802708K vm_area_struct
198814 198707  99%    3,69K  24857        8    795424K task_struct
2990368 2499431  83%    0,25K 186898       16    747592K filp
270498 270443  99%    2,00K  16907       16    541024K kmalloc-2048
124487 124471  99%    4,00K  15574        8    498368K kmalloc-4096
218567 218515  99%    2,06K  14573       15    466336K sighand_cache
555948 554329  99%    0,66K  46329       12    370632K proc_inode_cache
242424 242169  99%    1,00K  15153       16    242448K kmalloc-1024
7592960 7592441  99%    0,03K  59320      128    237280K kmalloc-32
221220 221172  99%    1,06K  14752       15    236032K signal_cache
218679 218646  99%    1,06K  14581       15    233296K mm_struct
334913 334913 100%    0,69K  14563       23    233008K files_cache
317829 317829 100%    0,69K  13821       23    221136K sock_inode_cache
3406080 3403192  99%    0,06K  53220       64    212880K anon_vma_chain
208568 208542  99%    1,00K  13037       16    208592K UNIX
971355 906067  93%    0,19K  46255       21    185020K dentry

Mostly more than 50% of the ram is used for slab. After a week oom killer kicks in and frees up some more memory (on this system, on other systems after 20 hours uptime).

I also took a look into open network connections / files / deleted files but these values seems pretty normal to me.

Thanks in advance, Alex

Alex
  • 1
  • 1

1 Answers1

0

This host does not have enough memory for the allocations it is doing. Increase memory to stop it crashing while you tune the applications and do a capacity assessment.

Based on free + cache in the atop output, memory is about 96% utilized. Absolutely the Linux virtual memory system considers this memory pressure. Not surprising that it is paging out.


You need to explain more about your application workload on this box and dig into their memory allocations. Linux runs all kinds of workloads, and the slab allocator has very generic buckets.

If you use cgroups, such as with systemd, use them to look at per service consumption. For example, if chrony is running (for NTP), /sys/fs/cgroup/memory/system.slice/chronyd.service/memory.kmem.slabinfo will contain its slab allocations. Repeat for the top cgroups by memory from the systemd-cgtop command.

The second largest slab by usage is 198 thousand task_struct objects. Tasks means processes. Is 277 from atop representative of how many tasks are you running at once? How often are your applications or scripts forking? sighand_cache sounds like signal handlers, what is the volume of signals being sent to tasks, and what do their handlers do?

Profile in detail with tools like Linux perf, ftrace, or bpf. See this past question for ideas about slab analysis.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34
  • Mainly I am running docker with "mailcow" on it. It brings several container-apps withit (eg. Postfix, ClamAV). Basically it's a mail system. Regarding to currently running tasks: (Tasks: 298 total, 1 running, 297 sleeping, 0 stopped, 0 zombie) Is there a way to find out which process is allocating that much SLAB? I googled around but didn't find a way. Could it be a memory leak? Before I upgraded to Debian 10 and migrated to another virtualization host I had no problem with this setup. I also increased the amount of memory when the problem first occured from 10gig to 14gig. – Alex Oct 16 '19 at 15:02