2

Ubuntu 16.04 x64bit kernel 4.4.0 cpu:8 , memory:31G , ZFS is main filesystem and cifs share is mounted

# sudo numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32157 MB
node 0 free: 2301 MB
node distances:
node   0 
  0:  10 

cat /proc/meminfo | grep -i huge
AnonHugePages:  13080576 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

my server freezes randomly with log in syslog(full log see pastebin) i read this article which explains these kind of error and possible resolution here

Jan 15 02:35:01 centrallogserver CRON[55892]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 15 02:36:49 centrallogserver kernel: [120146.673901] java: page allocation failure: order:4, mode:0x240c0c0
Jan 15 02:36:49 centrallogserver kernel: [120146.673908] CPU: 7 PID: 52372 Comm: java Tainted: P           O    4.4.0-112-generic #135-Ubuntu
Jan 15 02:36:49 centrallogserver kernel: [120146.673911] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/19/2018
Jan 15 02:36:49 centrallogserver kernel: [120146.673915]  0000000000000286 d4f0e41d54eb99fa ffff88038a7cb968 ffffffff813fc233
Jan 15 02:36:49 centrallogserver kernel: [120146.673920]  000000000240c0c0 0000000000000000 ffff88038a7cb9f8 ffffffff8119696a
Jan 15 02:36:49 centrallogserver kernel: [120146.673924]  d4f0e41d00000004 0000000000000004 0000000000000040 ffff880284f12a00
Jan 15 02:36:49 centrallogserver kernel: [120146.673929] Call Trace:
Jan 15 02:36:49 centrallogserver kernel: [120146.673938]  [<ffffffff813fc233>] dump_stack+0x63/0x90
Jan 15 02:36:49 centrallogserver kernel: [120146.673945]  [<ffffffff8119696a>] warn_alloc_failed+0xfa/0x150
Jan 15 02:36:49 centrallogserver kernel: [120146.673952]  [<ffffffff8119a14f>] ? __alloc_pages_direct_compact+0x10f/0x130
Jan 15 02:36:49 centrallogserver kernel: [120146.673959]  [<ffffffff8119a5fd>] __alloc_pages_slowpath.constprop.88+0x48d/0xb00
Jan 15 02:36:49 centrallogserver kernel: [120146.673966]  [<ffffffff8119aef6>] __alloc_pages_nodemask+0x286/0x2a0
Jan 15 02:36:49 centrallogserver kernel: [120146.673975]  [<ffffffff811e483c>] alloc_pages_current+0x8c/0x110
Jan 15 02:36:49 centrallogserver kernel: [120146.673980]  [<ffffffff81198ac9>] alloc_kmem_pages+0x19/0x90
Jan 15 02:36:49 centrallogserver kernel: [120146.673986]  [<ffffffff811b63ce>] kmalloc_order_trace+0x2e/0xe0
Jan 15 02:36:49 centrallogserver kernel: [120146.673993]  [<ffffffff811f10ce>] __kmalloc+0x22e/0x250
Jan 15 02:36:49 centrallogserver kernel: [120146.674053]  [<ffffffffc08e5c51>] smb2_unlock_range+0xa1/0x340 [cifs]
Jan 15 02:36:49 centrallogserver kernel: [120146.674094]  [<ffffffffc08daef1>] ? smb2_add_credits+0xb1/0x250 [cifs]
Jan 15 02:36:49 centrallogserver kernel: [120146.674137]  [<ffffffffc08bd600>] cifs_lock+0xc00/0x12a0 [cifs]
Jan 15 02:36:49 centrallogserver kernel: [120146.674142]  [<ffffffff811f048b>] ? __slab_free+0xcb/0x2c0
Jan 15 02:36:49 centrallogserver kernel: [120146.674147]  [<ffffffff811f048b>] ? __slab_free+0xcb/0x2c0
Jan 15 02:36:49 centrallogserver kernel: [120146.674154]  [<ffffffff8139677e>] ? common_file_perm+0x6e/0x1a0
Jan 15 02:36:49 centrallogserver kernel: [120146.674160]  [<ffffffff81266c6e>] vfs_lock_file+0x1e/0x40
Jan 15 02:36:49 centrallogserver kernel: [120146.674164]  [<ffffffff81266f6b>] do_lock_file_wait+0x5b/0x100
Jan 15 02:36:49 centrallogserver kernel: [120146.674170]  [<ffffffff811efc8a>] ? kmem_cache_alloc+0x1ca/0x1f0
Jan 15 02:36:49 centrallogserver kernel: [120146.674174]  [<ffffffff812651bb>] ? locks_alloc_lock+0x1b/0x70
Jan 15 02:36:49 centrallogserver kernel: [120146.674179]  [<ffffffff81268763>] fcntl_setlk+0x133/0x2c0
Jan 15 02:36:49 centrallogserver kernel: [120146.674186]  [<ffffffff812244c2>] SyS_fcntl+0x3e2/0x5e0
Jan 15 02:36:49 centrallogserver kernel: [120146.674193]  [<ffffffff818457ad>] entry_SYSCALL_64_fastpath+0x2b/0xe7
Jan 15 02:36:49 centrallogserver kernel: [120146.674197] Mem-Info:
Jan 15 02:36:49 centrallogserver kernel: [120146.674207] active_anon:3871678 inactive_anon:544913 isolated_anon:0
Jan 15 02:36:49 centrallogserver kernel: [120146.674207]  active_file:181867 inactive_file:199383 isolated_file:0
Jan 15 02:36:49 centrallogserver kernel: [120146.674207]  unevictable:5021 dirty:138 writeback:0 unstable:0
Jan 15 02:36:49 centrallogserver kernel: [120146.674207]  slab_reclaimable:232459 slab_unreclaimable:1851907
Jan 15 02:36:49 centrallogserver kernel: [120146.674207]  mapped:260688 shmem:6003 pagetables:26155 bounce:0
Jan 15 02:36:49 centrallogserver kernel: [120146.674207]  free:179404 free_pcp:283 free_cma:0
Jan 15 02:36:49 centrallogserver kernel: [120146.674217] Node 0 DMA free:15840kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jan 15 02:36:49 centrallogserver kernel: [120146.674230] lowmem_reserve[]: 0 2976 32142 32142 32142
Jan 15 02:36:49 centrallogserver kernel: [120146.674236] Node 0 DMA32 free:164924kB min:12132kB low:15164kB high:18196kB active_anon:465040kB inactive_anon:473384kB active_file:15584kB inactive_file:57088kB unevictable:1204kB isolated(anon):0kB isolated(file):0kB present:3129152kB managed:3048416kB mlocked:1204kB dirty:44kB writeback:0kB mapped:45748kB shmem:2908kB slab_reclaimable:146872kB slab_unreclaimable:1350916kB kernel_stack:6624kB pagetables:9124kB unstable:0kB bounce:0kB free_pcp:704kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 15 02:36:49 centrallogserver kernel: [120146.674249] lowmem_reserve[]: 0 0 29165 29165 29165
Jan 15 02:36:49 centrallogserver kernel: [120146.674255] Node 0 Normal free:536852kB min:118872kB low:148588kB high:178308kB active_anon:15021672kB inactive_anon:1706268kB active_file:711884kB inactive_file:740444kB unevictable:18880kB isolated(anon):0kB isolated(file):0kB present:30408704kB managed:29865212kB mlocked:18880kB dirty:508kB writeback:0kB mapped:997004kB shmem:21104kB slab_reclaimable:782964kB slab_unreclaimable:6056680kB kernel_stack:65472kB pagetables:95496kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 15 02:36:49 centrallogserver kernel: [120146.674267] lowmem_reserve[]: 0 0 0 0 0
Jan 15 02:36:49 centrallogserver kernel: [120146.674273] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15840kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674291] Node 0 DMA32: 504*4kB (UME) 2593*8kB (UME) 2117*16kB (UE) 3322*32kB (UH) 1*64kB (H) 2*128kB (H) 2*256kB (H) 2*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 164792kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674310] Node 0 Normal: 17501*4kB (UEH) 30839*8kB (UMH) 12373*16kB (UMH) 689*32kB (U) 0*64kB 1*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 536860kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674329] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674333] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674335] 408507 total pagecache pages
Jan 15 02:36:49 centrallogserver kernel: [120146.674338] 19222 pages in swap cache
Jan 15 02:36:49 centrallogserver kernel: [120146.674341] Swap cache stats: add 382634, delete 363412, find 121020/166633
Jan 15 02:36:49 centrallogserver kernel: [120146.674344] Free swap  = 3438328kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674346] Total swap = 4194300kB
Jan 15 02:36:49 centrallogserver kernel: [120146.674348] 8388461 pages RAM
Jan 15 02:36:49 centrallogserver kernel: [120146.674351] 0 pages HighMem/MovableOnly
Jan 15 02:36:49 centrallogserver kernel: [120146.674353] 156078 pages reserved
Jan 15 02:36:49 centrallogserver kernel: [120146.674355] 0 pages cma reserved
Jan 15 02:36:49 centrallogserver kernel: [120146.674357] 0 pages hwpoisoned
Jan 15 02:36:49 centrallogserver kernel: [120146.674577] java: page allocation failure: order:4, mode:0x240c0c0
Jan 15 02:36:49 centrallogserver kernel: [120146.674581] CPU: 7 PID: 52372 Comm: java Tainted: P           O    4.4.0-112-generic #135-Ubuntu
Jan 15 02:36:49 centrallogserver kernel: [120146.674585] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/19/2018

As possible workaround i have increased the min free bytes from 60MB to 256MB and vfs_cache_pressure=50 Similarly i decreased the zfs_arc_max and zfs_dirty_data_max to 8GB and 128MB respectively but still the problem persists. Please suggest what system tuning could be done to prevent freezing issue one possible way i see is disabling overcommiting so no memory is allocated larger than physical ram

sherpaurgen
  • 616
  • 6
  • 10
  • 26
  • The main thing to figure out is what’s using all your RAM. Have you tried using top or looking at a crash dump to figure that out? If so, posting that in the question would be helpful. – Dan Jan 16 '19 at 17:15
  • Can you share your JAVA OPTS ? Did you set the `xmx` and `xms` value for Java ? You need to track your processes and see which one is causing this memory error (if any). Memory errors can be very diverse. This is just one thought. – jayooin Jan 16 '19 at 13:45
  • This looks more like a CIFS problem instead of a ZFS problem, given the `cifs_lock` in the stack trace. The `Free swap = 3438328kB` line also seems to indicate that there's plenty of memory available. And given `Total swap = 4194300kB`, you're going to need a *much* bigger swap partition on a system with 32 GB of RAM if you disable memory overcommit. But for system reliablity, [disabling memory overcommit and therefore the OOM killer is probably a good idea](https://lwn.net/Articles/104185/). – Andrew Henle Jan 17 '19 at 10:26
  • @AndrewHenle , By "But for system reliablity, disabling memory overcommit and therefore the OOM killer is probably a good idea." are you suggesting disabling memory overcommit could help here ? i didnt get last part well – sherpaurgen Jan 17 '19 at 14:43
  • @satch_boogie The concept of memory overcommit/OOM killer is: "Process asks OS for memory, OS responds 'OK, you can use this memory', process tries to use that memory, OS kills process". Basically, the OS **lies** to the process about the memory available in the **hope** that the process won't actually use that much. If enough of the running processes actually do use enough of the memory the OS said it was OK for them to use, the OS than starts killing processes off. That's **fundamentally** unreliable - if you're running a database system, for example, your database process(es) get killed. – Andrew Henle Jan 17 '19 at 14:57
  • (cont) The "Out Of Fuel" link I posted is a parody of the OOM killer - the OOM killer that kills the database process(es) on a database server is the [Out Of Fuel mechanism tossing the pilot off the plane](https://lwn.net/Articles/104185/). Calling it a crazy idea is too unkind to crazy ideas. If you want reliability, you don't configure a system in a way that it will kill processes necessary for the functioning of that very system. – Andrew Henle Jan 17 '19 at 15:02
  • @AndrewHenle Thanks :) i have some understanding of how oom works but was asking about the disabling memory overcommitment part for system reliability. since the swap is almost free im not sure why memory allocation failed for 64 Kb – sherpaurgen Jan 21 '19 at 09:10

0 Answers0