WebLogic server seems to freeze after CMS Garbage Collection run

Question

We're using Java Hotspot 1.6 with a 10.3.6 WebLogic Cluster with 2 nodes running OSB 11.1.1.7 behind a physical load balancer:

java version "1.6.0_91"
Java(TM) SE Runtime Environment (build 1.6.0_91-b13)
Java HotSpot(TM) 64-Bit Server VM (build 20.91-b07, mixed mode)

We've tuned the JVM to use CMS and avoid long running garbage collections:

-XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:NewRatio=4 -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark

Most of the pause phases of the GCs (initial mark and remark) take less than a second, we don't see any failed CMS runs and we don't see Full GCs in the gc logs.

However, we're receiving timeout errors on clients trying to connect to OSB a few seconds after a completed CMS GC execution (particularly in an Oracle SOA installation with lots of composites that call OSB):

OSB 1 gc.log:

> 2016-07-28T08:31:52.656-0300: 492247.689: [GC
> 2016-07-28T08:31:52.656-0300: 492247.690: [ParNew:
> 1027265K->22014K(1132416K), 0.0412613 secs]
> 4046262K->3043322K(6165632K), 0.0420791 secs] [Times: user=0.25
> sys=0.00, real=0.04 secs]  2016-07-28T08:31:52.701-0300: 492247.735:
> [GC [1 CMS-initial-mark: 3021307K(5033216K)] 3043844K(6165632K),
> 0.0683781 secs] [Times: user=0.08 sys=0.00, real=0.07 secs]  2016-07-28T08:31:52.770-0300: 492247.804: [CMS-concurrent-mark-start]
> 2016-07-28T08:31:56.971-0300: 492252.005: [CMS-concurrent-mark:
> 4.201/4.201 secs] [Times: user=10.55 sys=0.16, real=4.20 secs]  2016-07-28T08:31:56.971-0300: 492252.005:
> [CMS-concurrent-preclean-start] 2016-07-28T08:31:57.064-0300:
> 492252.098: [CMS-concurrent-preclean: 0.090/0.093 secs] [Times: user=0.12 sys=0.00, real=0.09 secs]  2016-07-28T08:31:57.064-0300:
> 492252.098: [CMS-concurrent-abortable-preclean-start] 2016-07-28T08:32:01.198-0300: 492256.231: [GC
> 2016-07-28T08:32:01.198-0300: 492256.232: [ParNew:
> 1028128K->30711K(1132416K), 0.0502326 secs]
> 4049435K->3053383K(6165632K), 0.0508313 secs] [Times: user=0.34
> sys=0.00, real=0.05 secs] CMS: abort preclean due to time
> 2016-07-28T08:32:02.128-0300: 492257.161:
> [CMS-concurrent-abortable-preclean: 4.027/5.063 secs] [Times:
> user=7.62 sys=0.24, real=5.06 secs]  2016-07-28T08:32:02.134-0300:
> 492257.167: [GC[YG occupancy: 186361 K (1132416 K)]2016-07-28T08:32:02.134-0300: 492257.167: [GC
> 2016-07-28T08:32:02.134-0300: 492257.167: [ParNew:
> 186361K->25052K(1132416K), 0.0231613 secs]
> 3209034K->3049538K(6165632K), 0.0236069 secs] [Times: user=0.17
> sys=0.00, real=0.02 secs]  2016-07-28T08:32:02.158-0300: 492257.191:
> [Rescan (parallel) , 0.0289939 secs]2016-07-28T08:32:02.187-0300:
> 492257.220: [weak refs processing, 0.3412320 secs] [1 CMS-remark: 3024485K(5033216K)] 3049538K(6165632K), 0.4179878 secs] [Times:
> user=0.67 sys=0.00, real=0.42 secs]  2016-07-28T08:32:02.552-0300:
> 492257.586: [CMS-concurrent-sweep-start] 2016-07-28T08:32:07.144-0300: 492262.177: [CMS-concurrent-sweep: 4.591/4.591 secs] [Times: user=14.05 sys=0.22, real=4.59 secs]  2016-07-28T08:32:07.146-0300:
> 492262.180: [CMS-concurrent-reset-start] 2016-07-28T08:32:07.283-0300: 492262.316: [CMS-concurrent-reset: 0.137/0.137 secs] [Times: user=0.36 sys=0.01, real=0.14 secs]  2016-07-28T08:32:36.201-0300: 492291.234:
> [GC 2016-07-28T08:32:36.202-0300: 492291.234: [ParNew:
> 1030955K->45449K(1132416K), 0.0671313 secs]
> 2269248K->1285059K(6165632K), 0.0676485 secs] [Times: user=0.41
> sys=0.00, real=0.07 secs]

Is in the OSB WebLogic access log that we noticed this strange behavior, usually 5 or 10 seconds after CMS starts, there's a 20+ second gap between requests in the access log, when usually we see dozens of calls per second.

OSB1 access log:

2016-07-28 08:32:03 POST /service1/ 200 0.043  
2016-07-28 08:32:03 POST /service1/ 200 0.074   
2016-07-28 08:32:03 POST /service1/ 200 0.045   
2016-07-28 08:32:03 POST /service1/ 200 0.053   
2016-07-28 08:32:35 POST /service1/ 200 0.076

And at the same time some of the calls made from the SOA server fail:

SOA log error: https://esburl.mycompany.com/service1" successfully due to: Unable to invoke endpoint URI "https://esburl.mycompany.com/service1" successfully due to: javax.xml.soap.SOAPException: javax.xml.soap.SOAPException: Message send failed: Read timed out>

There's no errors reported in the OSB weblogic output log. We've taken some threadumps while GC was running but haven't been able to pinpoint the issue. We thought that if initial mark and remark were fast enough we wouldn't see this kind of problems, but we still have doubts regarding if this is a JVM, a WebLogic or an application issue (OSB). Anyone has seen something like this with WebLogic or another applications server?

Thanks!

EDIT:

It seems like the problem is SSL related:

~$ curl -v -k https://mycompany.com:9001/service1/ -o /dev/null 
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying x.x.x.x...
* Connected to mycompany.com (x.x.x.x) port 9001 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, **Client hello** (1):
} [data not shown]
  0     0    0     0    0     0      0      0 --:--:--  **0:00:16** --:--:--     0* 
  SSLv3, TLS handshake, **Server hello** (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using EDH-RSA-DES-CBC3-SHA
* Server certificate:
*    subject: C=xx; ST=xx; L=xx; O=xx; OU=Tech; CN=mycompany.com
*    start date: 2016-02-25 13:17:10 GMT
*    expire date: 2021-02-23 13:17:10 GMT
*    issuer: DC=xx; DC=xx; DC=xx; CN=xx
*    SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
> GET /service1 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: mycompany.com:9001
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 03 Aug 2016 15:46:41 GMT
< Transfer-Encoding: chunked
< Content-Type: text/xml; charset=UTF-8
< X-Powered-By: Servlet/2.5 JSP/2.1

It seems that while doing GC it takes the server a lot of time to complete the SSL handshake (in this example 16 seconds) But I don't have any idea why. Does anyone know of any java problem related to GCs and SSL protocol?

Regards.

Are you collecting resource usage stats? Disk, CPU, RAM, etc? I suspect that GC may be causing lots of Disk IO contention. — EEAA, Aug 02 '16 at 17:16
It seems that there's an uptick on CPU usage while GC runs, but doesn't go above 30% (at leat according to top). Not so sure about IO though, i'll consult with the OS administrator, this is a solaris on x86 box, not sure exactly what to look for. — Franco, Aug 02 '16 at 18:06

score 2 · Answer 1 · edited May 23 '17 at 11:33

After a long investigation we finally found the cause of the problem, while the CMS GC was running, we took several thread dumps and found this conspicuous stack in 5 or 6 threads:

"[ACTIVE] ExecuteThread: '35' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=3 tid=0x000000001333c000 nid=0xa1 runnable [0xfffffd7e23dd1000] 
   java.lang.Thread.State: RUNNABLE 
at sun.security.pkcs11.wrapper.PKCS11.C_DestroyObject(Native Method) 
at sun.security.pkcs11.SessionKeyRef.dispose(P11Key.java:1043) 
at sun.security.pkcs11.SessionKeyRef.drainRefQueueBounded(P11Key.java:1019) 
at sun.security.pkcs11.SessionKeyRef. (P11Key.java:1034) 
at sun.security.pkcs11.P11Key. (P11Key.java:98) 
at sun.security.pkcs11.P11Key$P11SecretKey. (P11Key.java:379) 
at sun.security.pkcs11.P11Key.secretKey(P11Key.java:271) 
at sun.security.pkcs11.P11TlsRsaPremasterSecretGenerator.engineGenerateKey(P11TlsRsaPremasterSecretGenerator.java:84) 
at javax.crypto.KeyGenerator.generateKey(DashoA13*..) 
at com.sun.net.ssl.internal.ssl.RSAClientKeyExchange. (RSAClientKeyExchange.java:68) 
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverHelloDone(ClientHandshaker.java:807) 
at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:286) 
at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:608) 
at com.sun.net.ssl.internal.ssl.Handshaker$1.run(Handshaker.java:548) 
at java.security.AccessController.doPrivileged(Native Method) 
at com.sun.net.ssl.internal.ssl.Handshaker$DelegatedTask.run(Handshaker.java:941) 
- locked <0xfffffd7e402b5ca0> (a com.sun.net.ssl.internal.ssl.SSLEngineImpl)

The execution of this code was present in the same thread in dumps taken 20-25 seconds apart, so it seemed a very good suspect.

After reviewing Oracle Supports documentation and these links:

https://bugs.openjdk.java.net/browse/JDK-8059337 https://stackoverflow.com/questions/31188663/disabling-pkcs11-solaris-implementation

We decided to disable Solaris pkcs11 implementation with "-Dsun.security.pkcs11.enable-solaris=false". Once we did that, all timeouts disappeared, the gaps in the access logs were no longer present after a GC.

I'm not sure if there's any advantage of using sun pcks11 over other providers, but disabling it seems to solve our problem and haven't found a new issue produced by the change so far, so we're going to stick with it.

WebLogic server seems to freeze after CMS Garbage Collection run

1 Answers1