We're currently running some performance tests using Solaris 11 (SPARC) on some large hardware. The tests, which consist of sending SOAP requests (50kb per request), are running well up until we get into the multiple of thousands of users i.e 30,000 users, where at ~ 2 minutes in we start seeing a number of Connection Time-out errors in the logs. CPU usage and memory usage is low, being no more that 15% at any time. We are using WebLogic 11g and Oracle HTTP Server.
I have adjusted the following TCP parameters, however they don't seem to have made any significant difference:
_conn_req_max_q = 262144 (also tried 16384)
_conn_req_max_q0 = 16384 (also tried 4096 - increased to remove tcpListenDrop0 being above 0)
_time_wait_interval = 15000
I also added the following to /etc/system:
set ip:ipcl_conn_hash_size=16834
Running netstat -sP tcp
at the end of the test (server rebooted before test started) results in:
TCP tcpRtoAlgorithm = 4 tcpRtoMin = 200
tcpRtoMax = 60000 tcpMaxConn = -1
tcpActiveOpens =133886 tcpPassiveOpens =584461
tcpAttemptFails =102899 tcpEstabResets =553474
tcpCurrEstab = 339 tcpOutSegs =35235864
tcpOutDataSegs =20302930 tcpOutDataBytes =842489656
tcpRetransSegs = 92070 tcpRetransBytes =337976
tcpOutAck =2044606 tcpOutAckDelayed =252534
tcpOutUrg = 0 tcpOutWinUpdate = 0
tcpOutWinProbe = 0 tcpOutControl =901262
tcpOutRsts = 29486 tcpOutFastRetrans = 0
tcpInSegs =39352489
tcpInAckSegs = 0 tcpInAckBytes =2742139410
tcpInDupAck = 32470 tcpInAckUnsent = 0
tcpInInorderSegs =15010534 tcpInInorderBytes =1321218448
tcpInUnorderSegs = 1515 tcpInUnorderBytes =2008280
tcpInDupSegs = 47362 tcpInDupBytes =160101
tcpInPartDupSegs = 0 tcpInPartDupBytes = 0
tcpInPastWinSegs = 0 tcpInPastWinBytes = 0
tcpInWinProbe = 0 tcpInWinUpdate = 0
tcpInClosed = 1099 tcpRttNoUpdate = 425
tcpRttUpdate =11258426 tcpTimRetrans =194800
tcpTimRetransDrop = 4 tcpTimKeepalive = 0
tcpTimKeepaliveProbe= 0 tcpTimKeepaliveDrop = 0
tcpListenDrop =300269 tcpListenDropQ0 = 0
tcpHalfOpenDrop = 0 tcpOutSackRetrans = 7
The tcpListenDrop value is still quite high, but that increases before we start seeing the errors in the logs, so it may be unrelated, I am not sure. Are there any other (TCP) parameters worth tuning to try and reduce the number of errors we are seeing? If not, any recommended way to diagnose this kind of issue?