2

We noticed large JVM pauses during garbage collection where user and system times are much smaller than the total time. [Times: user=3.99 sys=0.55, real=34.29 secs] We suspected it could be due to memory management and checked transparent and huge pages config which show both are disabled:

/sys/kernel/mm/redhat_transparent_hugepage/enabled:always [never]
/sys/kernel/mm/redhat_transparent_hugepage/defrag:[always] never
/sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag:[yes] no

However looking at THP and related counters, we see a lot of compaction stalls: egrep 'trans|thp|compact_' /proc/vmstat

nr_anon_transparent_hugepages 0 
compact_blocks_moved 113682 
compact_pages_moved 3535156 
compact_pagemigrate_failed 0 
compact_stall 1944 
compact_fail 186 
compact_success 1758 
thp_fault_alloc 6 
thp_fault_fallback 0 
thp_collapse_alloc 15 
thp_collapse_alloc_failed 0 
thp_split 17

So the question is, why THP and compaction stall/fail counters are not 0 if THPs are disabled and how to disable compaction so it does not interfere with our JVM (which we believe is the reason of long GC pauses) This is happening on RHEL6.2, 2.6.32-279.5.2.el6.x86_64, JVM 6u21 32-bit. Thanks!

Cyrus
  • 84,225
  • 14
  • 89
  • 153
olgg
  • 21
  • 1
  • 2
  • Doesn't this "/sys/kernel/mm/redhat_transparent_hugepage/defrag:[always] never" mean that it *IS* enabled? Wouldn't "always [never]" mean disabled? – Carlos Rendon Sep 17 '13 at 17:16
  • I think I had similar issue on Centos, does turn off transparent huge pages works? – Eric Yung Oct 27 '13 at 23:42

1 Answers1

1

To really get rid of THP you must make sure that not only the THP daemon is disabled, but also the THP defrag tool. defrag will run independent from THP, while the settings in /sys/kernel/mm/khugepaged/defrag only allow control whether the THP daemon may run defrag as well. That means: Even if your applications don't get the (potential) benefit of THP, the defragmentation process which makes your system stall is still active.

It is encouraged to use the distribution independent path for controlling THP & defrag settings: /sys/kernel/mm/transparent_hugepage/ (which may be a symlink to /sys/kernel/mm/redhat_transparent_hugepage)

This results in:

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

If you are running a java application and want to know whether THP/defrag is causing jvm pauses or stalls, it may be worth to have a look into your gc log. Having -XX:+PrintGcDetails enabled, you may observe "real" times that are significantly longer that the sys/user times.

In my case the following one-liner was sufficient

less gc.log | grep sys=0 | grep user=0 | grep -P "real=[1-9]"

The earliest description of the negative effects of THP is afaik this blog post by Greg Rahn: http://structureddata.org/2012/06/18/linux-6-transparent-huge-pages-and-hadoop-workloads/

cgicgi
  • 21
  • 3
  • I find it interesting that most of the posts I've read about THP don't mention this (and we're definitely seeing it ourselves as well) but yet people note that simply disabling THP is enough to fix their problem (including documentation for various data stores). I wonder if disabling is enough to _mostly_ fix latency issues due to THP, and this just clears it up entirely? – WheresWardy Mar 09 '16 at 09:13
  • As I already stated: memory defragmentation (which causes the system stall) is started by THP daemon AND defrag daemon. Disabling THP will probably reduce the number of stall. In some cases this will be sufficient, but who does really want to stop with a "maybe"? – cgicgi Mar 10 '16 at 16:47
  • True. Interestingly we've also found that we're still seeing latency issues from stalls (the value of compact_stall in /proc/vmstat is still rising) even though we've disabled both THP (as a kernel parameter during boot) and defrag (at runtime), although the issue is nowhere near as bad as before we did these things. We're continuing to investigate it. – WheresWardy Mar 12 '16 at 09:08
  • The compact_stall counter reports kswapd activity as well. – eckes Jan 29 '18 at 18:39