I have a web service which receives requests and after performing certain validations dumps data to a Message Queue. This service creates a ThreadPoolExecutor internally. I use Prometheus and Grafana to visualize application metrics and observed that CPU usage is always increasing and never reducing even during low workload (even if I detach the application from Load Balancer and serve no traffic).
Earlier I was running application using Jetty 9.0.6 and am observing this behavior after switching to Jetty 9.4.22. Reverting back to Jetty 9.0.6 (or even 9.4.13) seems to resolve this issue. So I suspect this might be related to Jetty (9.4.13+).
So I tried to analyze top command output and connect to application using VisualVM at the same time.
Linux top command reported that a certain thread from the ThreadPoolExecutor was using high CPU with CPU time over 500 minutes, however Visual VM showed that the thread is in waiting state.
top
command output (Few threads report high CPU usage for prolonged duration like following when they're actually in WAITING state)
10641 app 20 0 30.4g 21g 112m S 47.9 70.2 657:12.56 657:12 pool-5-thread-1
10914 app 20 0 30.4g 21g 112m R 33.7 70.2 508:25.46 508:25 pool-5-thread-1
10009 app 20 0 30.4g 21g 112m R 33.7 70.2 454:39.50 454:39 pool-5-thread-5
10220 app 20 0 30.4g 21g 112m S 33.7 70.2 444:55.14 444:55 pool-5-thread-7
12230 app 20 0 30.4g 21g 112m R 32.8 70.2 438:18.02 438:18 pool-5-thread-2
The thread state from Thread Dump
pool-5-thread-1
priority:5 - threadId:0x00007f13b0033000 - nativeId:0x8c9 - nativeId (decimal):2249 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000037ca83da0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
pool-5-thread-5
priority:5 - threadId:0x00007f13a0020800 - nativeId:0x8cd - nativeId (decimal):2253 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000037ca83da0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
pool-5-thread-7
priority:5 - threadId:0x00007f139c021800 - nativeId:0x8cf - nativeId (decimal):2255 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000037ca83da0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
pool-5-thread-2
priority:5 - threadId:0x00007f13bc021800 - nativeId:0x8ca - nativeId (decimal):2250 - state:WAITING
stackTrace:
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000037ca83da0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I need help understanding if this could be related to switching from Oracle JDK 8u65
to Amazon Corretto JDK 8u232
CPU usage as monitored in EC2 console
Update: Full top output (from another occurrence) -
top - 12:16:34 up 5:03, 0 users, load average: 10.64, 9.74, 9.72
Threads: 7123 total, 10 running, 7113 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.3 us, 16.0 sy, 0.0 ni, 54.8 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
KiB Mem: 32121548 total, 26088752 used, 6032796 free, 134028 buffers
KiB Swap: 0 total, 0 used, 0 free. 2008744 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5660 app 20 0 30.582g 0.021t 97556 S 47.0 69.6 128:42.34 pool-5-thread-3
4804 app 20 0 30.582g 0.021t 97556 R 49.6 69.6 98:41.07 pool-5-thread-1
2695 app 20 0 30.582g 0.021t 97556 R 49.9 69.6 93:10.97 pool-5-thread-5
2392 app 20 0 30.582g 0.021t 97556 S 48.0 69.6 79:47.46 pool-5-thread-9
2284 app 20 0 30.582g 0.021t 97556 S 3.2 69.6 13:04.52 pool-5-thread-2
JVM flags being used -
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=70
-Xss228k
-XX:G1HeapRegionSize=4
-XX:MaxGCPauseMillis=200
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:NativeMemoryTracking=detail
-XX:+PrintTenuringDistribution
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=20
-XX:GCLogFileSize=512M
-Xloggc:/home/app/logs/gc/${CURRENT_DATETIME}.log
-Xms100m
-Xmx${JAVA_XMX}m
-XX:ActiveProcessorCount=${CPUS}
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=200m
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.port=6001
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.rmi.port=6001
-Djava.rmi.server.hostname=${HOST}
-XX:+UsePerfData