3

I have a server of the following spec: Dual AMD EPYC 7742, 1TB RAM, 8TB SWAP (Round-robin 8xNVME array), 144TB SSD Storage (72 drives over 10 zpools)

uname -a

Linux host 5.4.44-1-pve #1 SMP PVE 5.4.44-1 (Fri, 12 Jun 2020 08:18:46 +0200) x86_64 GNU/Linux

The server has Proxmox 6.2 installed & is up to date as of 02/07/20. The host has 1350 LXC containers & maintains a load average of just under 1 at this level of containers. RAM is at 800GB/1TB, SWAP is at 1.6TB/7.28TB.

Each container has been built off the proxmox ubuntu 18.04 lxc image & they are all almost identical clones of eachother. The containers make heavy use of the fast SWAP array due to only requiring RAM for a single 60s computation upon booting. Once completed, under sufficient memory pressure, they push almost all of their used RAM into SWAP, only requiring occasional reading from SWAP.

Upon creating the 1353rd LXC Container, I see a vmap allocation error in the syslog:

Jul 02 20:34:53 host kernel: lxc-start: vmalloc: allocation failure: 4096 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=ns,mems_allowed=0-1
Jul 02 20:34:53 host kernel: CPU: 65 PID: 3438449 Comm: lxc-start Tainted: P           OE     5.4.44-1-pve #1
Jul 02 20:34:53 host kernel: Hardware name: Supermicro Super Server/H11DSi-NT, BIOS 2.0 09/25/2019
Jul 02 20:34:53 host kernel: Call Trace:
Jul 02 20:34:53 host kernel:  dump_stack+0x6d/0x9a
Jul 02 20:34:53 host kernel:  warn_alloc.cold.119+0x7b/0xdd
Jul 02 20:34:53 host kernel:  ? __get_vm_area_node+0x149/0x160
Jul 02 20:34:53 host kernel:  ? bpf_jit_alloc_exec+0xe/0x10
Jul 02 20:34:53 host kernel:  __vmalloc_node_range+0x1aa/0x270
Jul 02 20:34:53 host kernel:  ? bpf_jit_alloc_exec+0xe/0x10
Jul 02 20:34:53 host kernel:  module_alloc+0x82/0xe0
Jul 02 20:34:53 host kernel:  ? bpf_jit_alloc_exec+0xe/0x10
Jul 02 20:34:53 host kernel:  bpf_jit_alloc_exec+0xe/0x10
Jul 02 20:34:53 host kernel:  bpf_jit_binary_alloc+0x63/0xf0
Jul 02 20:34:53 host kernel:  ? emit_mov_reg+0xf0/0xf0
Jul 02 20:34:53 host kernel:  bpf_int_jit_compile+0x133/0x34d
Jul 02 20:34:53 host kernel:  bpf_prog_select_runtime+0xa8/0x130
Jul 02 20:34:53 host kernel:  bpf_prepare_filter+0x52e/0x5a0
Jul 02 20:34:53 host kernel:  bpf_prog_create_from_user+0xc5/0x110
Jul 02 20:34:53 host kernel:  ? hardlockup_detector_perf_cleanup.cold.9+0x1a/0x1a
Jul 02 20:34:53 host kernel:  do_seccomp+0x2bf/0x8d0
Jul 02 20:34:53 host kernel:  __x64_sys_seccomp+0x1a/0x20
Jul 02 20:34:53 host kernel:  do_syscall_64+0x57/0x190
Jul 02 20:34:53 host kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 02 20:34:53 host kernel: RIP: 0033:0x7f29737d6f59
Jul 02 20:34:53 host kernel: Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
Jul 02 20:34:53 host kernel: RSP: 002b:00007ffc72a9e738 EFLAGS: 00000246 ORIG_RAX: 000000000000013d
Jul 02 20:34:53 host kernel: RAX: ffffffffffffffda RBX: 000055d0b17813b0 RCX: 00007f29737d6f59
Jul 02 20:34:53 host kernel: RDX: 000055d0b177fa90 RSI: 0000000000000000 RDI: 0000000000000001
Jul 02 20:34:53 host kernel: RBP: 000055d0b177fa90 R08: 000055d0b17813b0 R09: 000055d0b177ad00
Jul 02 20:34:53 host kernel: R10: 000055d0b178dfd0 R11: 0000000000000246 R12: 00007ffc72a9e7dc
Jul 02 20:34:53 host kernel: R13: 0000000000000000 R14: 00000000ffffffff R15: 000055d0b177ad00
Jul 02 20:34:53 host kernel: Mem-Info:
Jul 02 20:34:53 host kernel: active_anon:57085917 inactive_anon:92502441 isolated_anon:0
 active_file:17684788 inactive_file:8397670 isolated_file:0
 unevictable:167729 dirty:675 writeback:401 unstable:0
 slab_reclaimable:5604171 slab_unreclaimable:27013702
 mapped:5668112 shmem:56359 pagetables:1963891 bounce:0
 free:20376422 free_pcp:131976 free_cma:0
Jul 02 20:34:53 host kernel: Node 0 active_anon:111954916kB inactive_anon:172197032kB active_file:35764692kB inactive_file:17457324kB unevictable:399796kB isolated(anon):0kB isolated(file):0kB mapped:11123132kB dirty:1160kB writeback:644kB shmem:137436kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jul 02 20:34:53 host kernel: Node 1 active_anon:116388752kB inactive_anon:197812732kB active_file:34974460kB inactive_file:16133356kB unevictable:271120kB isolated(anon):0kB isolated(file):0kB mapped:11549316kB dirty:1540kB writeback:960kB shmem:88000kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jul 02 20:34:53 host kernel: Node 0 DMA free:15876kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jul 02 20:34:53 host kernel: lowmem_reserve[]: 0 2561 515798 515798 515798
Jul 02 20:34:53 host kernel: Node 0 DMA32 free:2625288kB min:220kB low:2840kB high:5460kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2732964kB managed:2665112kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:2956kB local_pcp:0kB free_cma:0kB
Jul 02 20:34:53 host kernel: lowmem_reserve[]: 0 0 513236 513236 513236
Jul 02 20:34:53 host kernel: Node 0 Normal free:37110116kB min:44820kB low:570372kB high:1095924kB active_anon:111954916kB inactive_anon:172197032kB active_file:35764692kB inactive_file:17457324kB unevictable:399796kB writepending:1804kB present:533970944kB managed:525553736kB mlocked:399796kB kernel_stack:590520kB pagetables:4130116kB bounce:0kB free_pcp:254676kB local_pcp:1444kB free_cma:0kB
Jul 02 20:34:53 host kernel: lowmem_reserve[]: 0 0 0 0 0
Jul 02 20:34:53 host kernel: Node 1 Normal free:41754408kB min:45064kB low:573476kB high:1101888kB active_anon:116388752kB inactive_anon:197812732kB active_file:34974460kB inactive_file:16133356kB unevictable:271120kB writepending:2500kB present:536866816kB managed:528422152kB mlocked:271120kB kernel_stack:519000kB pagetables:3725448kB bounce:0kB free_pcp:270220kB local_pcp:264kB free_cma:0kB
Jul 02 20:34:53 host kernel: lowmem_reserve[]: 0 0 0 0 0
Jul 02 20:34:53 host kernel: Node 0 DMA: 1*4kB (U) 2*8kB (U) 1*16kB (U) 1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15876kB
Jul 02 20:34:53 host kernel: Node 0 DMA32: 6*4kB (UM) 6*8kB (M) 8*16kB (M) 6*32kB (M) 6*64kB (M) 6*128kB (M) 5*256kB (UM) 8*512kB (UM) 9*1024kB (UM) 10*2048kB (UM) 632*4096kB (M) = 2625288kB
Jul 02 20:34:53 host kernel: Node 0 Normal: 70110*4kB (UME) 528589*8kB (UME) 278440*16kB (UME) 77872*32kB (UME) 98148*64kB (UM) 34504*128kB (UME) 6830*256kB (UME) 2138*512kB (UME) 722*1024kB (UM) 167*2048kB (UME) 2693*4096kB (UM) = 37109088kB
Jul 02 20:34:53 host kernel: Node 1 Normal: 1440*4kB (UME) 256581*8kB (UM) 92674*16kB (UM) 16683*32kB (UME) 36437*64kB (UM) 6712*128kB (UME) 7106*256kB (UM) 2334*512kB (UM) 2282*1024kB (UME) 609*2048kB (UM) 6809*4096kB (M) = 41753960kB
Jul 02 20:34:53 host kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jul 02 20:34:53 host kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jul 02 20:34:53 host kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jul 02 20:34:53 host kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jul 02 20:34:53 host kernel: 57982225 total pagecache pages
Jul 02 20:34:53 host kernel: 31838959 pages in swap cache
Jul 02 20:34:53 host kernel: Swap cache stats: add 684735119, delete 652970785, find 114714945/164869068
Jul 02 20:34:53 host kernel: Free swap  = 6113127072kB
Jul 02 20:34:53 host kernel: Total swap = 7814100640kB
Jul 02 20:34:53 host kernel: 268396680 pages RAM
Jul 02 20:34:53 host kernel: 0 pages HighMem/MovableOnly
Jul 02 20:34:53 host kernel: 4232461 pages reserved
Jul 02 20:34:53 host kernel: 0 pages cma reserved
Jul 02 20:34:53 host kernel: 0 pages hwpoisoned

I am not able to interpret the output in order to understand which direction to look. Most similar errors are from old 32-bit kernels where the issue is resolved by passing a vmalloc=512M in the grub boot loader, but with 64-but kernels the VMALLOC is much larger, as evidenced by /proc/meminfo (vmalloctotal=34TB,vmallocused=24GB):

cat /proc/meminfo

MemTotal:       1056656876 kB
MemFree:        76849680 kB
MemAvailable:   200978380 kB
Buffers:           74844 kB
Cached:         108220668 kB
SwapCached:     128272136 kB
Active:         299102888 kB
Inactive:       407757724 kB
Active(anon):   228172048 kB
Inactive(anon): 370632756 kB
Active(file):   70930840 kB
Inactive(file): 37124968 kB
Unevictable:      675628 kB
Mlocked:          675628 kB
SwapTotal:      7814100640 kB
SwapFree:       6112054688 kB
Dirty:              2500 kB
Writeback:           556 kB
AnonPages:      499566192 kB
Mapped:         22947384 kB
Shmem:            223532 kB
KReclaimable:   22638384 kB
Slab:           131330980 kB
SReclaimable:   22638384 kB
SUnreclaim:     108692596 kB
KernelStack:     1108256 kB
PageTables:      7894616 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    8342429076 kB
Committed_AS:   2407556960 kB
VmallocTotal:   34359738367 kB
VmallocUsed:    23920452 kB
VmallocChunk:          0 kB
Percpu:         25101312 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:    133753764 kB
DirectMap2M:    265598976 kB
DirectMap1G:    674234368 kB

free -m

              total        used        free      shared  buff/cache   available
Mem:        1031891      828580       75310         203      128001      196680
Swap:       7630957     1662143     5968813

Could someone please indicate what specific limitation is implied by the call trace & kernel log? Considering that VMALLOC should have a much higher limit on a 64-bit system.

EDIT: further info I have followed the LXC tuning described here: https://linuxcontainers.org/lxd/docs/master/production-setup

sysctl.conf (there is excessive tuning in here, as I encountered many load issues up to this point - turns out my router was spamming out too many Router Advertisments & was bringing my server to its knees).

net.ipv4.neigh.default.gc_interval = 3600
net.ipv6.neigh.default.gc_interval = 3600
net.ipv4.neigh.default.gc_stale_time = 3600
net.ipv6.neigh.default.gc_stale_time = 3600
net.ipv4.neigh.default.gc_thresh1 = 80000
net.ipv4.neigh.default.gc_thresh2 = 90000
net.ipv4.neigh.default.gc_thresh3 = 100000
net.ipv6.neigh.default.gc_thresh1 = 80000
net.ipv6.neigh.default.gc_thresh2 = 90000
net.ipv6.neigh.default.gc_thresh3 = 100000
vm.swappiness=100
kernel.keys.maxkeys = 100000000
kernel.keys.maxbytes = 200000000
kernel.dmesg_restrict = 1
vm.max_map_count = 262144
net.ipv6.conf.default.autoconf = 0
fs.inotify.max_queued_events = 167772160
fs.inotify.max_user_instances = 167772160  # def:128
fs.inotify.max_user_watches = 167772160  # def:8192
net.core.bpf_jit_limit = 300000000000
kernel.keys.root_maxbytes = 2000000000
kernel.keys.root_maxkeys = 1000000000
kernel.pid_max = 4194304
kernel.keys.gc_delay = 300
kernel.keys.persistent_keyring_expiry = 259200
fs.aio-max-nr = 524288
kernel.pty.max = 10000
net.core.somaxconn=512000
fs.file-max = 1048576
net.ipv4.ip_local_port_range = 12000 65535
kernel.pty.reserve = 2048
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 1048576 2097152
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_mem = 50576   64768   98152
net.core.netdev_max_backlog = 50000
net.core.netdev_budget = 10000
net.core.netdev_budget_usecs = 2000
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=0
net.ipv4.tcp_fin_timeout=20
kernel.sched_migration_cost_ns = 5000000
kernel.sched_autogroup_enabled = 0

/etc/security/limits.conf

  *     soft  nofile      1048576     unset
  *     hard  nofile      1048576     unset
  root  soft  nofile      1048576     unset
  root  hard  nofile      1048576     unset
  *     soft  memlock     unlimited   unset
  *     hard  memlock     unlimited   unset
  root  soft  memlock     unlimited   unset
  root  hard  memlock     unlimited   unset

/etc/modprobe.d/zfs.conf

options zfs zfs_arc_max=103079215104
options zfs l2arc_noprefetch=0
options zfs zfs_arc_dnode_limit_percent=75
options zfs zfs_arc_meta_limit_percent=75

joinsplit
  • 31
  • 4
  • 1
    Did you really create 8TB of swap and go 1.6TB deep into it?! You may have hit a kernel bug, or you may just need to add more RAM. Not sure if your server will accept more than 1TB though. – Michael Hampton Jul 02 '20 at 20:44
  • Yeah it is an unusual build, but has worked well for this very specific use case, where the RAM requirement per container is very short lived. My server can take up to 4TB RAM, but the cost is extreme. 64GB sticks are the best price point, and the SWAP enables that to be leveraged well, albeit in this unique usecase. RAM seems ok for now, I don't think there is excessive pressure - I've given ZFS ARC 96GB, but it works well even when capped to 32GB. – joinsplit Jul 02 '20 at 20:54
  • 2
    For what it's worth my money is on kernel bug. You'll probably have to ask Proxmox to provide you a later stable (currently 5.7.x) kernel and of course whatever out-of-tree modules they are providing. It looks like they only build longterm kernels though. – Michael Hampton Jul 02 '20 at 20:58
  • Thanks, yeah I had thought that might be a good idea - i have a thread going on proxmox under community support, but will move to paid support once I can no longer make progress with it & will hopefully be able to test a newer kernel. Moving from 5.3 to 5.4 solved many issues with the build. – joinsplit Jul 02 '20 at 21:02
  • If there's some way you can build your own custom kernel and have it work with Proxmox, you may wish to spend some time trying that with the latest 5.7.x kernel and see how you get on. I wasn't able to find anything appropriate from a quick search, so this might turn out to be too difficult or time consuming even for an experienced admin. But it's worth a look. – Michael Hampton Jul 02 '20 at 21:14
  • Yep I had come across something along these lines in this post: https://forum.proxmox.com/threads/compile-proxmox-ve-with-patched-intel-iommu-driver-to-remove-rmrr-check.36374/page-4#post-305534 I think I'll have to practice on another server, & when time permits give it a proper attempt. – joinsplit Jul 02 '20 at 21:25
  • Are the containers privileged or unprivileged? seccomp is in the trace. If unprivileged, in theory that could be disabled and still be fairly secure. I have no idea how feasible hacking out seccomp is, however. – John Mahowald Jul 03 '20 at 02:50
  • The containers are unprivileged - I have had a look at disabling seccomp before, without much success - that what when I encountered the same seccomp error (Unknown Error 524) that was a result of the tunable that @kubanczyk alluded to. So his solution is no longer applicable at this stage, it certainly seems related again (lxc-start throws the same error, but now we see a vmap allocation error as well). I will create a privileged container to test if that works, perhaps that might point me in another direction. – joinsplit Jul 03 '20 at 16:55
  • Privileged containers encounter the same error, so it seems it is not related to constraints around being unprivileged. – joinsplit Jul 04 '20 at 15:31
  • @MichaelHampton I found this repo: https://github.com/fabianishere/pve-edge-kernel. I think its the proxmox guys building the latest kernels, I just need guidance on how to install/use them. – joinsplit Jul 05 '20 at 20:00
  • It looks like a directory for building .deb packages. If you've never done this before, it's got a pretty steep learning curve. They really should provide instructions somewhere if they want people to use those. – Michael Hampton Jul 05 '20 at 20:04
  • They seem to have built them: https://github.com/fabianishere/pve-edge-kernel/releases/tag/v5.7.2 , but yeah judging by the lack of context/instructions, i don't think its anything they usually support, and probably just for their testing. – joinsplit Jul 05 '20 at 20:24
  • Followed these steps: https://forum.proxmox.com/threads/unofficial-proxmox-kernel-4-8-1.30448/ & I'm now on proxmox-ve: 6.2-1 (running kernel: 5.7.2-1-zen2). So let's see how this goes when I hit 1350... – joinsplit Jul 05 '20 at 20:56
  • I need to rebuild apparmor from source against the new kernel headers - I'll ask another question on that before retesting & updating here. – joinsplit Jul 06 '20 at 10:41

1 Answers1

1

The call stack is at bpf_jit_alloc_exec and you have quite a lot free memory, so there is a good chance you need to look into the new bpf_jit_limit tunable and increase it (it's in bytes not in pages).

kubanczyk
  • 13,812
  • 5
  • 41
  • 55
  • thanks - I will update my question with my sysctl.conf. I have updated bpf_jit_limit tunable to a very large number as advised on the production setup doc from linuxcontainers.org - that limit was encountered at 600 containers. – joinsplit Jul 02 '20 at 20:45
  • i am unsure if you really want so massive amounts of containers on one host, ever thought about an failure? – djdomi Jul 02 '20 at 20:58
  • yeah i agree, the risk is magnified by concentrating so many containers on a single host. The new EPYC CPUs presented such good value per core though, and in order to make a more informed decision on the next server (in terms of price/density), I need to find the maximum capabilities of this build for my use case. – joinsplit Jul 02 '20 at 21:07