3

I have a java application running on Linux (CentOS 7) using g1 gc, and it regularly hangs for several seconds, it looks just like gc pauses, but I can't find such long pauses in gc log.

To make sure that the java application hangs, I started a background thread who dose nothing else but print out a log every 500ms. And the log was found paused for several seconds. Here is the log and it paused at [14:31:02,834] to [14:31:05,677]

WARN [2018-07-16 14:30:57,831][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:58,331][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:58,832][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:59,332][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:59,832][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:00,333][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:00,833][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:01,333][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:01,834][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:02,334][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:02,834][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:05,677][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:06,177][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:06,678][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:07,178][clock]py.datanode.DataNodeAppEngine(196):tick...

And the gc log at the same time (grep with Total time for which application threads were stopped) :

2018-07-16T14:30:58.327+0800: 2679.423: Total time for which application threads were stopped: 0.3750533 seconds, Stopping threads took: 0.0003430 seconds
2018-07-16T14:31:05.176+0800: 2686.272: Total time for which application threads were stopped: 0.5037637 seconds, Stopping threads took: 0.0004556 seconds
2018-07-16T14:31:06.978+0800: 2688.074: Total time for which application threads were stopped: 0.0060367 seconds, Stopping threads took: 0.0003190 seconds

Besides, this java process also have some threads running native codes, they are written in C, and are not influenced by jvm. Those threads were running well, I am quite sure about this is because one of these threads are heartbeat threads and the heartbeat time out is 800MS, but no heartbeat timeout was found during the pause.

I also monitored the cpu usage, and the cpu with 12 cores was up to 80% idle.

Memory usage is not too high neither, THP(transparent huge pages) and the swap memory were also disabled.

And I found one thing I can't understand:

There is always a concurrent-mark-start near the pause, and wherever a concurrent-mark-start occurs, there is also a pause.

2018-07-16T14:30:58.489+0800: 2679.586: [GC concurrent-mark-start]

I know the concurrent mark phase won't cause STW, but I can't believe this is just an coincidence cause I reproduced this for a lot of times and it's always act like this.

And here are the CPU usage and memory usage during one of the pauses from YourKit :

enter image description here enter image description here

Thanks to @jspcal's suggestion, I got the SafepointStatistics :

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
2566.430: G1IncCollectionPause             [     745          0              0    ]      [     0     0  2705     3   474    ]  0

The G1IncCollectionPause took nearly 3 seconds to get to a safe point, while the spin and the block time are both 0

T.Tony
  • 495
  • 3
  • 15
  • If your application is CPU bound you could be starving the logging threads. You should start graphing the heap usage if you really think the application is in deep GC. Maybe take a look at some memory/JVM analyzers https://visualvm.github.io/download.html – flakes Jul 16 '18 at 07:26
  • @flakes I have updated the cpu and memory usage – T.Tony Jul 16 '18 at 07:44

2 Answers2

3

While GC is one source of VM pauses, safepoints (stop the world pauses) can be initiated by other operations like flushing the code cache, biased locking, certain debug operations, etc. Here is a list of safepoint operations. To troubleshoot these safepoints, use these options:

Safepoints:

-XX:+UnlockDiagnosticVMOptions
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=500
-XX:+LogVMOutput
-XX:LogFile=/var/log/jvm/vm.log

GC:

-verbose:gc
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails 
-Xloggc:/var/log/jvm/garbage.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=100M
jspcal
  • 50,847
  • 7
  • 72
  • 76
  • I got the log, the G1IncCollectionPause took long. While the spin and block took 0, how can I know what is causing such a pause ? – T.Tony Jul 16 '18 at 09:26
  • 1
    With `SafepointTimeout` you'll see which threads failed to reach the safepoint in time. – jspcal Jul 16 '18 at 09:30
  • I added "-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=2000", but no timeout was printed out. And still, spin and block is 0, sync is very large(7 seconds). Not only G1IncCollectionPause took that long, any kind of operation can lead to that. Does this mean that the whole process was frozen ? – T.Tony Jul 16 '18 at 11:14
3

Finally I found it's caused by a jdk's bug about concurrent-marking for large reference arrays:

Large reference arrays cause extremely long synchronization times

And my solution is to change the large reference array to parted Two-Dimension arrays

T.Tony
  • 495
  • 3
  • 15