0

Background

  • In our Presto service, we found that real time was longer than the sum of user and sys times. For details, refer to the previous question:which is G1 young STW time?
  • The Presto service is run on the k8s pod, we don't find the root cause. We suspect that servic may be a lack of cpu, so we increase the memory quota, and modify the jvm config to print STW time and safepoint statistics.
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintSafepointStatistics 
-XX:PrintSafepointStatisticsCount=1

New Problem

  • After that, the Presto service encountered many puases,and the STW records are as follows
2022-11-10T17:23:14.851+0800: 7689.689: Total time for which application threads were stopped: 0.0026007 seconds, Stopping threads took: 0.0002632 seconds
2022-11-10T17:23:40.160+0800: 7714.999: Total time for which application threads were stopped: 21.8407322 seconds, Stopping threads took: 0.0002557 seconds
2022-11-10T17:23:40.164+0800: 7715.002: Total time for which application threads were stopped: 0.0025454 seconds, Stopping threads took: 0.0004116 seconds
  • Related safepoint statistics are as follows
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
7693.158: RevokeBias                       [     868          0              0    ]      [     0     0     0     2     0    ]  0
  • The times for each phase in safepiint is extremely short and almost equal to zero. But the service stopped for 21 secs.
  • Has anyone else encountered the same problem? Waiting for help

Lakc of cpu, or the system problem? I have no ideas,

sunrise
  • 1
  • 2

0 Answers0