We're using Java Hotspot 1.6 with a 10.3.6 WebLogic Cluster with 2 nodes running OSB 11.1.1.7 behind a physical load balancer:
java version "1.6.0_91"
Java(TM) SE Runtime Environment (build 1.6.0_91-b13)
Java HotSpot(TM) 64-Bit Server VM (build 20.91-b07, mixed mode)
We've tuned the JVM to use CMS and avoid long running garbage collections:
-XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:NewRatio=4 -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=60 -XX:+ScavengeBeforeFullGC -XX:+CMSScavengeBeforeRemark
Most of the pause phases of the GCs (initial mark and remark) take less than a second, we don't see any failed CMS runs and we don't see Full GCs in the gc logs.
However, we're receiving timeout errors on clients trying to connect to OSB a few seconds after a completed CMS GC execution (particularly in an Oracle SOA installation with lots of composites that call OSB):
OSB 1 gc.log:
> 2016-07-28T08:31:52.656-0300: 492247.689: [GC
> 2016-07-28T08:31:52.656-0300: 492247.690: [ParNew:
> 1027265K->22014K(1132416K), 0.0412613 secs]
> 4046262K->3043322K(6165632K), 0.0420791 secs] [Times: user=0.25
> sys=0.00, real=0.04 secs] 2016-07-28T08:31:52.701-0300: 492247.735:
> [GC [1 CMS-initial-mark: 3021307K(5033216K)] 3043844K(6165632K),
> 0.0683781 secs] [Times: user=0.08 sys=0.00, real=0.07 secs] 2016-07-28T08:31:52.770-0300: 492247.804: [CMS-concurrent-mark-start]
> 2016-07-28T08:31:56.971-0300: 492252.005: [CMS-concurrent-mark:
> 4.201/4.201 secs] [Times: user=10.55 sys=0.16, real=4.20 secs] 2016-07-28T08:31:56.971-0300: 492252.005:
> [CMS-concurrent-preclean-start] 2016-07-28T08:31:57.064-0300:
> 492252.098: [CMS-concurrent-preclean: 0.090/0.093 secs] [Times: user=0.12 sys=0.00, real=0.09 secs] 2016-07-28T08:31:57.064-0300:
> 492252.098: [CMS-concurrent-abortable-preclean-start] 2016-07-28T08:32:01.198-0300: 492256.231: [GC
> 2016-07-28T08:32:01.198-0300: 492256.232: [ParNew:
> 1028128K->30711K(1132416K), 0.0502326 secs]
> 4049435K->3053383K(6165632K), 0.0508313 secs] [Times: user=0.34
> sys=0.00, real=0.05 secs] CMS: abort preclean due to time
> 2016-07-28T08:32:02.128-0300: 492257.161:
> [CMS-concurrent-abortable-preclean: 4.027/5.063 secs] [Times:
> user=7.62 sys=0.24, real=5.06 secs] 2016-07-28T08:32:02.134-0300:
> 492257.167: [GC[YG occupancy: 186361 K (1132416 K)]2016-07-28T08:32:02.134-0300: 492257.167: [GC
> 2016-07-28T08:32:02.134-0300: 492257.167: [ParNew:
> 186361K->25052K(1132416K), 0.0231613 secs]
> 3209034K->3049538K(6165632K), 0.0236069 secs] [Times: user=0.17
> sys=0.00, real=0.02 secs] 2016-07-28T08:32:02.158-0300: 492257.191:
> [Rescan (parallel) , 0.0289939 secs]2016-07-28T08:32:02.187-0300:
> 492257.220: [weak refs processing, 0.3412320 secs] [1 CMS-remark: 3024485K(5033216K)] 3049538K(6165632K), 0.4179878 secs] [Times:
> user=0.67 sys=0.00, real=0.42 secs] 2016-07-28T08:32:02.552-0300:
> 492257.586: [CMS-concurrent-sweep-start] 2016-07-28T08:32:07.144-0300: 492262.177: [CMS-concurrent-sweep: 4.591/4.591 secs] [Times: user=14.05 sys=0.22, real=4.59 secs] 2016-07-28T08:32:07.146-0300:
> 492262.180: [CMS-concurrent-reset-start] 2016-07-28T08:32:07.283-0300: 492262.316: [CMS-concurrent-reset: 0.137/0.137 secs] [Times: user=0.36 sys=0.01, real=0.14 secs] 2016-07-28T08:32:36.201-0300: 492291.234:
> [GC 2016-07-28T08:32:36.202-0300: 492291.234: [ParNew:
> 1030955K->45449K(1132416K), 0.0671313 secs]
> 2269248K->1285059K(6165632K), 0.0676485 secs] [Times: user=0.41
> sys=0.00, real=0.07 secs]
Is in the OSB WebLogic access log that we noticed this strange behavior, usually 5 or 10 seconds after CMS starts, there's a 20+ second gap between requests in the access log, when usually we see dozens of calls per second.
OSB1 access log:
2016-07-28 08:32:03 POST /service1/ 200 0.043
2016-07-28 08:32:03 POST /service1/ 200 0.074
2016-07-28 08:32:03 POST /service1/ 200 0.045
2016-07-28 08:32:03 POST /service1/ 200 0.053
2016-07-28 08:32:35 POST /service1/ 200 0.076
And at the same time some of the calls made from the SOA server fail:
SOA log error: https://esburl.mycompany.com/service1" successfully due to: Unable to invoke endpoint URI "https://esburl.mycompany.com/service1" successfully due to: javax.xml.soap.SOAPException: javax.xml.soap.SOAPException: Message send failed: Read timed out>
There's no errors reported in the OSB weblogic output log. We've taken some threadumps while GC was running but haven't been able to pinpoint the issue. We thought that if initial mark and remark were fast enough we wouldn't see this kind of problems, but we still have doubts regarding if this is a JVM, a WebLogic or an application issue (OSB). Anyone has seen something like this with WebLogic or another applications server?
Thanks!
EDIT:
It seems like the problem is SSL related:
~$ curl -v -k https://mycompany.com:9001/service1/ -o /dev/null
* Hostname was NOT found in DNS cache
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying x.x.x.x...
* Connected to mycompany.com (x.x.x.x) port 9001 (#0)
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* SSLv3, TLS handshake, **Client hello** (1):
} [data not shown]
0 0 0 0 0 0 0 0 --:--:-- **0:00:16** --:--:-- 0*
SSLv3, TLS handshake, **Server hello** (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using EDH-RSA-DES-CBC3-SHA
* Server certificate:
* subject: C=xx; ST=xx; L=xx; O=xx; OU=Tech; CN=mycompany.com
* start date: 2016-02-25 13:17:10 GMT
* expire date: 2021-02-23 13:17:10 GMT
* issuer: DC=xx; DC=xx; DC=xx; CN=xx
* SSL certificate verify result: self signed certificate in certificate chain (19), continuing anyway.
> GET /service1 HTTP/1.1
> User-Agent: curl/7.35.0
> Host: mycompany.com:9001
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Wed, 03 Aug 2016 15:46:41 GMT
< Transfer-Encoding: chunked
< Content-Type: text/xml; charset=UTF-8
< X-Powered-By: Servlet/2.5 JSP/2.1
It seems that while doing GC it takes the server a lot of time to complete the SSL handshake (in this example 16 seconds) But I don't have any idea why. Does anyone know of any java problem related to GCs and SSL protocol?
Regards.