Understanding high system CPU usage on web server on Windows

Question

NOTE We have a couple of close votes for this post. I am not sure exactly why since no one has commented but this question is about how our custom written Java server and its calls into socket reads can interact with the Windows OS and produce this high system CPU usage, therefore I believe this is the correct forum to post this question rather than a sysadmin area.

I have a custom Java server that is serving TCP connections of a couple of different types (some web, some raw) and one of our customers is having issues with occasional periods of high CPU + high system CPU.

We may have reproduced this on our test server which is running on Windows Server 2008 on vmware.

During a failover from one server to another or during other network events that cause a number of clients (40-80) to re-establish their connections simultaneously we see high CPU usage including very high system CPU (>50% average with spikes of over 70%). These clients typically maintain TCP sockets to the server so during the failover we would see perhaps 80-160 sockets being reconnected and serviced.

We have used kernrate to profile during one of these periods and are having a hard time understanding the results.

Kernrate reports low idle time and high kernel time as we would expect during this period:

P0     K 0:00:39.405 (55.9%)  U 0:00:27.424 (38.9%)  I 0:00:03.603 ( 5.1%)  DPC
0:00:01.528 ( 2.2%)  Interrupt 0:00:00.327 ( 0.5%)
       Interrupts= 114880, Interrupt Rate= 1631/sec.

P1     K 0:00:37.596 (53.4%)  U 0:00:29.281 (41.6%)  I 0:00:03.556 ( 5.0%)  DPC
0:00:00.078 ( 0.1%)  Interrupt 0:00:00.624 ( 0.9%)
       Interrupts= 96016, Interrupt Rate= 1363/sec.

TOTAL  K 0:01:17.002 (54.7%)  U 0:00:56.706 (40.3%)  I 0:00:07.160 ( 5.1%)  DPC
0:00:01.606 ( 1.1%)  Interrupt 0:00:00.951 ( 0.7%)
       Total Interrupts= 210896, Total Interrupt Rate= 2994/sec.

Total Profile Time = 70434 msec

There is plenty of physical ram available to both the host and the guest and perfmon reports page faults of from zero to 200/sec with maybe a couple of spikes during the 70s period of 800/sec. The Java heap is set to 3GB which is doesn't come close to the edge of during the problem period (or at all). We don't believe it is doing garbage collection during this time.

                               BytesStart          BytesStop         BytesDiff.
Available Physical Memory   ,      6296776704,      6266650624,       -30126080
Available Pagefile(s)       ,     12591370240,     12577845248,       -13524992
Available Virtual           ,   8796052869120,   8796048171008,        -4698112
Available Extended Virtual  ,               0,               0,  0
Committed Memory Bytes      ,       305979392,       303964160,        -2015232
Non Paged Pool Usage Bytes  ,        81997824,        82333696,          335872
Paged Pool Usage Bytes      ,       164028416,       164057088,           28672
Paged Pool Available Bytes  ,      4147101696,      4149116928,         2015232
Free System PTEs            ,        33556099,        33556099,  0

We have made some efforts to consolidate I/O calls and increase the average bytes per I/O operation. Generally though they don't seem too unreasonable:

                              Total      Avg. Rate
Context Switches     ,       462496,         6566/sec.
System Calls         ,       835593,         11863/sec.
Page Faults          ,        32814,         466/sec.
I/O Read Operations  ,          747,         11/sec.
I/O Write Operations ,         3792,         54/sec.
I/O Other Operations ,        27565,         391/sec.
I/O Read Bytes       ,       382146,         512/ I/O
I/O Write Bytes      ,       684128,         180/ I/O
I/O Other Bytes      ,       890365,         32/ I/O

Kernrate reports the vast majority of time spent in the NTOSKRNL module and a rate of almost 10 million events per second:

OutputResults: KernelModuleCount = 131
Percentage in the following table is based on the Total Hits for the Kernel
ProfileTime   12806 hits, 65536 events per hit --------
 Module                                Hits   msec  %Total  Events/Sec
NTOSKRNL                              10561      70429    82 %     9827282
AMDPPM                                 1001      70429     7 %      931456
HAL                                     701      70429     5 %      652298
VMXNET3N61X64                           351      70428     2 %      326619
WIN32K                                   68      70428     0 %       63276
TCPIP                                    35      70428     0 %       32568
AFD                                      19      70428     0 %       17680
NETIO                                    17      70429     0 %       15818
VM3DMP                                   11      70428     0 %       10235
NTFS                                      6      70429     0 %        5583
NDIS                                      6      70429     0 %        5583
LSI_SAS                                   6      70429     0 %        5583
FLTMGR                                    6      70429     0 %        5583
VMMOUSE                                   5      70429     0 %        4652
I8042PRT                                  5      70429     0 %        4652
VMCI                                      3      70429     0 %        2791
CDD                                       2      70428     0 %        1861
DXGMMS1                                   1      70428     0 %         930
VMHGFS                                    1      70429     0 %         930
NSIPROXY                                  1      70428     0 %         930

With the breakdown in the NTOSKRNL module being as follows:

ProfileTime   10561 hits, 65536 events per hit --------
 Module                                Hits   msec  %Total  Events/Sec
KeSynchronizeExecution                 2485      70429    23 %     2312356
ExReleaseRundownProtectionCacheAwareEx       2099      70429    19 %     1953173

IoGetPagingIoPriority                  1561      70429    14 %     1452550
memmove                                1006      70429     9 %      936108
RtlDelete                               952      70429     9 %      885860
memset                                  448      70429     4 %      416875
ExfAcquirePushLockExclusive             326      70429     3 %      303351
PoStartNextPowerIrp                     280      70429     2 %      260547
RtlFindClearBits                        239      70429     2 %      222395
KeBugCheckEx                            199      70429     1 %      185174
KeWaitForMultipleObjects                 92      70429     0 %       85608
FsRtlTeardownPerFileContexts             67      70429     0 %       62345
KeSetTimer                               62      70429     0 %       57692
NtWaitForSingleObject                    59      70429     0 %       54901
wcsncat_s                                58      70429     0 %       53970

What I don't understand here is how what seems like a reasonable number of I/O operations per second (<500/sec if I understand kernrate properly?) translates into millions of events for KeSynchronizeExecution, ExReleaseRundownProtectionCacheAwareEx and IoGetPagingIoPriority, which are presumably the source of the high kernel CPU?

I have searched for more information on these functions but information is thin on the ground about what they do and what can translate into high CPU usage regarding them.

Not really a good idea, but maybe you want to look into "full scale" profiling; for example by completely instrumenting all your code so that you **really** understand how much time your code is spending where. — GhostCat, Jul 06 '16 at 12:01
We have run jconsole with top threads during this time and what we see is that the threads are in Socket.read - reading from the socket into a provided byte array buffer, for some reason those socket reads are consuming large amounts of kernel CPU so this doesn't seem to be an issue of determining why our code is doing a lot of CPU cycles (which would result in high user CPU and low kernel CPU) — AntonyM, Jul 06 '16 at 12:04
Can you use JProfiler to profile your code. That should tell you about any spikes in terms of threads(CPU) or memory(RAM) coming from your application. I believe it might even tell you how much time is spent in exactly which class and method. — Vishal Jumani, Jul 06 '16 at 12:19
I believe jconsole in combination with the top threads plugin has already identified that the threads are in socket.read (through a BufferedInputStream) during the high cpu periods which I believe also correlates with the kernrate results. The problem we seem to face is one of high cpu usage somewhere in the kernel during those method calls. This would not be in our code which would register as user mode cpu usage but must be somehow being provoked by our code or the network hence the more detailed breakdown from kernrate of the NTOSKRNL module. — AntonyM, Jul 06 '16 at 15:22
You may have close votes because you haven't given any code, and unless you're asking about a problem with your code (i.e. a [MCVE]), this question is off-topic. — Nic, Jul 06 '16 at 18:47
Why is an example necessary in this case given the detailed information about the specific API call and its effects on the OS and the kernel? The question is about a particular API call and help in interpreting the effects of said API call based on the kernrate trace. I don't see how example code would contribute at this point? Either someone knows how to interpret kernrate in depth or not? I could post on superuser but I would expect more guidance on diagnosing a problem app or driver than understanding the underlying CPU usage based on API call? Where would be a better place to post? — AntonyM, Jul 06 '16 at 21:38
How do you handle incomming connection requests? Are you handle each TCP connection inside a thread? — user743414, Jul 07 '16 at 07:19
So you're using one thread per connection? If so then this is your problem and the reason why your kernel is busy. He's switching between all your threads. — user743414, Jul 08 '16 at 11:05
Even with a bunch of threads to switch between shouldn't the kernel be able to do more than a few hundred IO operations per second without going to 100% cpu? Why does so few IO operations per second translate into 2 million KeSynchronizeExecution requests per second? — AntonyM, Jul 08 '16 at 11:50
I do not think that your I/O is the problem. The problem is that your kernel has to switch between your threads a lot. https://en.wikipedia.org/wiki/Context_switch So one thread per connection is a bad design. But without code which reproduces your problem it's not possible to help you further. — user743414, Jul 11 '16 at 07:45
Perfmon reports context switches as low during the hang so I am confident that is not the problem. The server does both blocking and non blocking io and one thread per tcp socket is not always bad design depending on how the socket is used. On a frequently used socket a pool can create restrictions and constant thread creates can be inefficient. — AntonyM, Jul 11 '16 at 08:52
Without more info, how you do things it's just guessing. Maybe you have a general locking issue inside your program? Who knows? — user743414, Jul 11 '16 at 13:59
It took a long time to work through this but I believe (some) others will surely be having the same issue. The long version is at http://www.jwrapper.com/blog/high-privileged-time-and-exceptions-important-java-jvm-options-you-should-consider-if-running-a-server. The short version is that this is a bug in the OpenJDK pseudo memory barrier optimisation which is causing one or more threads to generate OS exceptions in a pauseless loop, starving the entire OS, making profiling impossible. It is not related to any specific app code but is made more likely under load. -XX:+UseMembar fixes it. — AntonyM, Aug 24 '16 at 23:43
I would also like to add my opinion that I think this question has been incorrectly closed. Issues caused in a Java program by an OpenJDK bug are clearly programming related and with some issues (like this one) a major part of resolving the issue is in working towards a minimal verifiable complete example, but people with appropriate knowledge and experience (e.g. in depth knowledge of the Windows kernel, driver architecture and knowledge of the probable usecases for specific kernel functions in this case) can still potentially help in that process. — AntonyM, Aug 24 '16 at 23:50
[Process Explorer](https://learn.microsoft.com/en-us/sysinternals/downloads/process-explorer) is a powerful tool in debug windows performance issue. It can check deadlock thread or memory issue. — Steven Chou, Mar 26 '18 at 03:52

Understanding high system CPU usage on web server on Windows

0 Answers0