5

I detected on a server long safepoints (>10sec!) in jvm safepoint.log:

6534.953: no vm operation                  [     353          0              4    ]      [     0     0 14179     0     0    ]  0
7241.410: RevokeBias                       [     357          0              1    ]      [     0     0 14621     0     0    ]  0
8501.278: BulkRevokeBias                   [     356          0              6    ]      [     0     0 13440     0     2    ]  0
9667.681: no vm operation                  [     349          0              8    ]      [     0     0 15236     0     0    ]  0
12094.170: G1IncCollectionPause             [     350          0              4    ]      [     0     0 15144     1    24    ]  0
13383.412: no vm operation                  [     348          0              4    ]      [     0     0 15783     0     0    ]  0
13444.109: RevokeBias                       [     349          0              2    ]      [     0     0 16084     0     0    ]  0

On my laptop I've played with -XX:SafepointTimeoutDelay=2 and it works good, printing bad threads:

# SafepointSynchronize::begin: Timeout detected: 
...
# SafepointSynchronize::begin: (End of list)
<writer thread='11267'/>
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
567.766: BulkRevokeBias                   [      78          1              2    ]      [     0     6     6     0     0    ]  0

So I've added to options to the server: -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000 to see which threads cause the problem, but I don't see any prints, while I still see long safepoint times.

Why doesn't it applied on the server?

Here is the actual server config (taken from safepoint.log):

Java HotSpot(TM) 64-Bit Server VM (25.202-b08) for linux-amd64 JRE (1.8.0_202-b08), built on Dec 15 2018 12:40:22 by &quot;java_re&quot; with gcc 7.3.0
...
-XX:+PrintSafepointStatistics 
-XX:PrintSafepointStatisticsCount=10
-XX:+UnlockDiagnosticVMOptions 
-XX:+LogVMOutput
-XX:LogFile=/opt/pprb/card-pro/pci-pprb-eip57/logs/safepoint.log
-XX:+SafepointTimeout 
-XX:SafepointTimeoutDelay=1000
...
  • You restarted the VM, right? – Chris Neve Mar 27 '19 at 09:54
  • Maybe try adding -XX:+PrintGCApplicationStoppedTime to see if you get any more info? – Chris Neve Mar 27 '19 at 09:59
  • Chris, yes, restarted. – Sergey Kosarev Mar 27 '19 at 10:15
  • We already have `-XX:+PrintGCApplicationStoppedTime`. In GC logs I see the same problems: `2019-03-25T12:42:57.678+0300: 6777.056: Total time for which application threads were stopped: 18.0049752 seconds, Stopping threads took: 18.0036770 seconds` – Sergey Kosarev Mar 27 '19 at 10:20
  • It's not clear from the question, if no safepoint statistics is printed at all, or if the safepoint statistics printed with empty thread list. Please paste the logs. – apangin Mar 27 '19 at 10:54
  • @apangin, hi, only usual safepoint staistics is printed `13383.412: no vm operation [ 348 0 4 ] [ 0 0 15783 0 0 ] 0`, but no any line with 'Timeout detected is printed. I'll share full log a bit later. – Sergey Kosarev Mar 27 '19 at 11:30
  • @apangin, here is full [safepoint.log](https://drive.google.com/file/d/1bfY568eT6wA2y3Jfgh1rODCBijF958n7/view?usp=sharing) – Sergey Kosarev Mar 27 '19 at 11:35
  • Seems like the process did not receive CPU time, and this is not because of particular thread(s), but rather due to some external problem on OS level. Swapping or direct reclaim are the typical reasons. – apangin Mar 28 '19 at 18:40
  • @apangin, how could I check your assumption? Regarding to swapping, I see /proc/sys/vm/swappines=0 on the server. – Sergey Kosarev Mar 29 '19 at 08:43
  • 1
    Create a native thread which wakes for example each 100 ms and check if it also experiences long pauses. In any case, always monitor OS memory usage, I/O utilization etc. – apangin Mar 31 '19 at 16:45
  • BTW, swappines=0 does not prevent from swapping. Start from looking at `/proc/meminfo` – apangin Mar 31 '19 at 16:47

1 Answers1

0

In safepoint, "Total time for which application threads were stopped: 18.0049752 seconds, Stopping threads took: 18.0036770 seconds" maybe caused by a thread wait for lock, and maybe not. When SafepointTimeoutDelay=1000, if more than one thread wait for 1s, there will invoke SafepointSynchronize::print_safepoint_timeout method in safepoint.cpp to print certain ThreadSafepointState. But when all thread come to safepoint and other reason to hold in 18s, the method will be not called and no logs for it.

We can set safepoint=trace in jdk9+ to know all thread state in gc log.

kimmking
  • 1
  • 3