I have a java application running on Linux (CentOS 7) using g1 gc, and it regularly hangs for several seconds, it looks just like gc pauses, but I can't find such long pauses in gc log.
To make sure that the java application hangs, I started a background thread who dose nothing else but print out a log every 500ms. And the log was found paused for several seconds. Here is the log and it paused at [14:31:02,834] to [14:31:05,677]
WARN [2018-07-16 14:30:57,831][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:58,331][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:58,832][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:59,332][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:30:59,832][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:00,333][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:00,833][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:01,333][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:01,834][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:02,334][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:02,834][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:05,677][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:06,177][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:06,678][clock]py.datanode.DataNodeAppEngine(196):tick...
WARN [2018-07-16 14:31:07,178][clock]py.datanode.DataNodeAppEngine(196):tick...
And the gc log at the same time (grep with Total time for which application threads were stopped) :
2018-07-16T14:30:58.327+0800: 2679.423: Total time for which application threads were stopped: 0.3750533 seconds, Stopping threads took: 0.0003430 seconds
2018-07-16T14:31:05.176+0800: 2686.272: Total time for which application threads were stopped: 0.5037637 seconds, Stopping threads took: 0.0004556 seconds
2018-07-16T14:31:06.978+0800: 2688.074: Total time for which application threads were stopped: 0.0060367 seconds, Stopping threads took: 0.0003190 seconds
Besides, this java process also have some threads running native codes, they are written in C, and are not influenced by jvm. Those threads were running well, I am quite sure about this is because one of these threads are heartbeat threads and the heartbeat time out is 800MS, but no heartbeat timeout was found during the pause.
I also monitored the cpu usage, and the cpu with 12 cores was up to 80% idle.
Memory usage is not too high neither, THP(transparent huge pages) and the swap memory were also disabled.
And I found one thing I can't understand:
There is always a concurrent-mark-start near the pause, and wherever a concurrent-mark-start occurs, there is also a pause.
2018-07-16T14:30:58.489+0800: 2679.586: [GC concurrent-mark-start]
I know the concurrent mark phase won't cause STW, but I can't believe this is just an coincidence cause I reproduced this for a lot of times and it's always act like this.
And here are the CPU usage and memory usage during one of the pauses from YourKit :
Thanks to @jspcal's suggestion, I got the SafepointStatistics :
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
2566.430: G1IncCollectionPause [ 745 0 0 ] [ 0 0 2705 3 474 ] 0
The G1IncCollectionPause took nearly 3 seconds to get to a safe point, while the spin and the block time are both 0