Background
- In our Presto service, we found that real time was longer than the sum of user and sys times. For details, refer to the previous question:which is G1 young STW time?
- The Presto service is run on the k8s pod, we don't find the root cause. We suspect that servic may be a lack of cpu, so we increase the memory quota, and modify the jvm config to print STW time and safepoint statistics.
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
New Problem
- After that, the Presto service encountered many puases,and the STW records are as follows
2022-11-10T17:23:14.851+0800: 7689.689: Total time for which application threads were stopped: 0.0026007 seconds, Stopping threads took: 0.0002632 seconds
2022-11-10T17:23:40.160+0800: 7714.999: Total time for which application threads were stopped: 21.8407322 seconds, Stopping threads took: 0.0002557 seconds
2022-11-10T17:23:40.164+0800: 7715.002: Total time for which application threads were stopped: 0.0025454 seconds, Stopping threads took: 0.0004116 seconds
- Related safepoint statistics are as follows
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
7693.158: RevokeBias [ 868 0 0 ] [ 0 0 0 2 0 ] 0
- The times for each phase in safepiint is extremely short and almost equal to zero. But the service stopped for 21 secs.
- Has anyone else encountered the same problem? Waiting for help
Lakc of cpu, or the system problem? I have no ideas,