Apache Tomcat chokes after 300 connections

Question

We have an apache webserver in front of Tomcat hosted on EC2, instance type is extra large with 34GB memory.

Our application deals with lot of external webservices and we have a very lousy external webservice which takes almost 300 seconds to respond to requests during peak hours.

During peak hours the server chokes at just about 300 httpd processes. ps -ef | grep httpd | wc -l =300

I have googled and found numerous suggestions but nothing seems to work.. following are some configuration i have done which are directly taken from online resources.

I have increased the limits of max connection and max clients in both apache and tomcat. here are the configuration details:

//apache

   <IfModule prefork.c>
    StartServers 100
    MinSpareServers 10
    MaxSpareServers 10
    ServerLimit 50000
    MaxClients 50000
    MaxRequestsPerChild 2000
    </IfModule>

//tomcat

    <Connector port="8080" protocol="org.apache.coyote.http11.Http11NioProtocol"
           connectionTimeout="600000"
           redirectPort="8443"
           enableLookups="false" maxThreads="1500"
           compressableMimeType="text/html,text/xml,text/plain,text/css,application/x-javascript,text/vnd.wap.wml,text/vnd.wap.wmlscript,application/xhtml+xml,application/xml-dtd,application/xslt+xml"
           compression="on"/>

//Sysctl.conf

 net.ipv4.tcp_tw_reuse=1
 net.ipv4.tcp_tw_recycle=1
 fs.file-max = 5049800
 vm.min_free_kbytes = 204800
 vm.page-cluster = 20
 vm.swappiness = 90
 net.ipv4.tcp_rfc1337=1
 net.ipv4.tcp_max_orphans = 65536
 net.ipv4.ip_local_port_range = 5000 65000
 net.core.somaxconn = 1024

I have been trying numerous suggestions but in vain.. how to fix this? I'm sure m2xlarge server should serve more requests than 300, probably i might be going wrong with my configuration..

The server chokes only during peak hours and when there are 300 concurrent requests waiting for the [300 second delayed] webservice to respond.

I was just monitoring the tcp connections with netstat

i found around 1000 connections in TIME_WAIT state, no idea what that would mean in terms of performance, i'm sure it must be adding to the problem.

Output of TOP

 8902  root      25   0 19.6g 3.0g  12m S  3.3  8.8  13:35.77 java
 24907 membase   25   0  753m 634m 2528 S  2.7  1.8 285:18.88 beam.smp
 24999 membase   15   0  266m 121m 3160 S  0.7  0.3  51:30.37 memcached
 27578 apache    15   0  230m 6300 1536 S  0.7  0.0   0:00.03 httpd
 28551 root      15   0 11124 1492  892 R  0.3  0.0   0:00.25 top


 Output of free -m
 total       used       free     shared    buffers    cached
 35007       8470       26536    0          1         61
 8407        26599
 15999       15         15984

 output of iostat
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
      26.21    0.00    0.48    0.13    0.02   73.15

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda1             14.36         4.77       329.37    9005402  622367592
sdb               0.00         0.00         0.00       1210         48

Also at peak time there are about 10-15k tcp connections to membase server[local]

SOME ERRORS IN MODJK LOG, I hope this throws some light on the issue..

[Wed Jul 11 14:39:10.853 2012] [8365:46912560456400] [error]         ajp_send_request::jk_ajp_common.c (1630): (tom2) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)
[Wed Jul 11 14:39:18.627 2012] [8322:46912560456400] [error] ajp_send_request::jk_ajp_common.c (1630): (tom2) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=110)
[Wed Jul 11 14:39:21.358 2012] [8351:46912560456400] [error] ajp_get_reply::jk_ajp_common.c (2118): (tom1) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Wed Jul 11 14:39:22.640 2012] [8348:46912560456400] [error] ajp_get_reply::jk_ajp_common.c (2118): (tom1) Tomcat is down or refused connection. No response has been sent to the client (yet)

~

Worker.properties
workers.tomcat_home=/usr/local/tomcat/
worker.list=loadbalancer
worker.tom1.port=8009
worker.tom1.host=localhost
worker.tom1.type=ajp13
worker.tom1.socket_keepalive=True
worker.tom1.connection_pool_timeout=600
worker.tom2.port=8109
worker.tom2.host=localhost
worker.tom2.type=ajp13
worker.tom2.socket_keepalive=True
worker.tom2.connection_pool_timeout=600
worker.loadbalancer.type=lb
worker.loadbalancer.balanced_workers=tom1,tom2
worker.loadbalancer.sticky_session=True
worker.tom1.lbfactor=1
worker.tom1.socket_timeout=600
worker.tom2.lbfactor=1
worker.tom2.socket_timeout=600

//Solved

thansk all for your valuable suggestions.. i missed out the maxThreads settings for the AJP 1.3 connector.. Now everything seems under control.

I would also start looking at even based servers like nginx.

What kind of error do clients get back when attempting to load a page? — Shane Madden, Jul 08 '12 at 23:28
Did you increase the max allowed open file descriptions for the apache/httpd user ? — golja, Jul 09 '12 at 01:47
@Tom My Keep Alive Settings are KeepAliveTimeout 10 in httpd.conf — john titus, Jul 09 '12 at 06:35
@golja How do i increase max allowed open file descriptions for the apache/httpd user? — john titus, Jul 09 '12 at 06:37
@Shane.. Users are left with loading page.. no response from server, also no errors logged in any error log — john titus, Jul 09 '12 at 06:37
Depends which linux distribution you use, but this link should be helpful http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/ — golja, Jul 09 '12 at 06:44
What does the output of `top` look like during these times? How about `free -m`? And lastly `iostat` ? — Zypher, Jul 10 '12 at 23:54
@Wyann Yes.. Tomcat can handle 300 connections, we have verified it during non peak hours — john titus, Jul 11 '12 at 08:05
Is it not possible to cache the response from the webservice that takes 300 seconds to run and serve up its response to clients? — Matthew Ife, Jul 11 '12 at 08:12
Perhaps a silly question, but if it changes every 2 minutes but takes 5 minutes to arrive, isn't the content already out of date by the time you receive it? — Matthew Ife, Jul 11 '12 at 11:33
@Mlfe there are many requests which get responded in couple of seconds as well.., also this behaviour is only observed during peak hours.. — john titus, Jul 11 '12 at 12:45
From the logs from modjk, it would seem that you have apache configured to use modjk to connect to tomcat, not modproxy. So what is the tomcat configuration for the ajp connector? ( It is typically the 8009 port ). — becomingwisest, Jul 11 '12 at 16:11
two tomcat instances one on port 8009 and the other on 8109, check edited question for worker.properties — john titus, Jul 12 '12 at 06:42

score 13 · Accepted Answer · answered Jul 11 '12 at 15:45

13

Have you increased maxThreads in the AJP 1.3 Connector on port 8009?

answered Jul 11 '12 at 15:45

HTTP500

4,833
4
23
31

1500 is what i have per tomcat instance – john titus Jul 12 '12 at 06:47
@john, Are you saying that for every Connector you have specified maxThreads="1500 "? Can you post your stanza for the AJP 1.3 Connector (port 8009)? – HTTP500 Jul 12 '12 at 15:34
thanks for pointing this out.. there is no maxThreads setting for AJP1.3 at all.. could this be the reason? – john titus Jul 12 '12 at 17:58
1

Yes, add maxThreads to the stanza for that Connector. The default is 200. – HTTP500 Jul 13 '12 at 01:56

score 7 · Answer 2 · answered Jul 11 '12 at 09:40

Consider setting up an asynchronous proxying web server like nginx or lighttpd in front of Apache. Apache serves content synchronously so workers are blocked until clients download generated content in full (more details here). Setting up an asynchronous (non-blocking) proxy usually improves situation dramatically (I used to lower the number of concurrently running Apache workers from 30 to 3-5 using nginx as a frontend proxy).

score 5 · Answer 3 · answered Jul 11 '12 at 13:48

I suspect your problem is in tomcat not apache, from the logs you have shown anyway. When you get 'error 110' trying to connect back into tomcat it indicates you've got a queue of connections waiting to be served that no more can fit into the listening backlog setup for the listening socket in tomcat.

From the listen manpage:
   The  backlog  parameter defines the maximum length the queue of pending 
   connections may grow to.  If a connection request arrives with
   the queue full the client may receive an error with an indication
   of ECONNREFUSED or, if the underlying protocol supports  
   retransmission, the request may be ignored so that retries succeed.

If I had to guess, I would suspect that the vast majority of HTTP requests when the server is "choking" is blocked waiting for something to come back from tomcat. I bet if you attempted to fetch some static content thats directly served up by apache (rather than being proxied to tomcat) that this would work even when its normally 'choking'.

I am not familiar with tomcat unfortunately, but is there a way to manipulate the concurrency settings of this instead?

Oh, and you might need to also consider the possibility that its the external network services thats limiting the number of connections that it is doing to you down to 300, so it makes no difference how much manipulating of concurrency you are doing on your front side if practically every connection you make relies on an external web services response.

In one of your comments you mentioned data goes stale after 2 minutes. I'd suggest caching the response you get from this service for two minutes to reduce the amount of concurrent connections you are driving to the external web service.

poige · Answer 4 · 2012-07-11T15:58:20.150

2

The first step to troubleshoot this is enabling Apache's mod_status and studying its report — until you've done this, actually you're blindly walking. That's not righteous. ;-)

The second thing to mention (I by myself dislike to be told answers to questions I wasn't asking, but ...) is using more efficient and special front-ends servers like nginx.

Also, did you exactly restart apache, or just gracefully reloaded it? :)

edited Jul 11 '12 at 15:58

answered Jul 11 '12 at 15:38

poige

9,448
2
25
52

Apache restarted.. not a graceful reload – john titus Jul 12 '12 at 06:45
@johntitus, well, `mod_status` is your friend, anyways. :) – poige Jul 12 '12 at 09:09

score 1 · Answer 5 · answered Jul 10 '12 at 13:25

1

For any sort of enterprise-y deployment, the prefork MPM is just about the worst choice you can make: it gobbles resources like nobody's business, and restarting threads takes FOREVER compared to other MPMs.

At least switch to the worker MPM (apache 2.2 and up) or - better yet - upgrade to the current stable version 2.4.2 with its default event MPM.

Both of these will easily handle thousands of concurrent connections with very little overhead.

answered Jul 10 '12 at 13:25

adaptr

16,576
23
34

thanks.. tried that too.. no luck. TIME_WAIT connections keep increasing. Server stops responding at 350 connections – john titus Jul 10 '12 at 13:38
1

I disagree that it's the worst choice - it is a poor choice for this context and it's likely that the problems would be eased by using the threaded server, but a better solution would be to use an event based server (nginx or lighttpd). The event based Apache is not nearly mature enough to be considered from an enterprise deployment IMHO. – symcbean Jul 11 '12 at 11:18

score 1 · Answer 6 · answered Feb 18 '16 at 21:14

I know it is an old story, but I have 2 remarks.

There is a hard coded limit for ServerLimit Directive. http://httpd.apache.org/docs/2.2/mod/mpm_common.html#serverlimit you'll see that it is max 20000/200K.

There is a hard limit of ServerLimit 20000 compiled into the server (for the prefork MPM 200000). This is intended to avoid nasty effects caused by typos.

2nd Apparently nodybo mentioned that setting those 2 to one is a very bad idea:

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1

it means you reuse the timewait early, guess what? the server may talk to the wrong client under heavy load.

I found a very good article explaining that but - it is french ;-) http://vincent.bernat.im/fr/blog/2014-tcp-time-wait-state-linux.html

score 0 · Answer 7 · answered Jul 11 '12 at 11:24

extra large with 34GB memory.

Big iron is not the way to scale webserving you're just moving the bottlenecks around. But even with this much memory, I suspect that 50000 connections is pushing what the system is capable of particularly if:

During peak hours the server chokes at just about 300 httpd processes

It would be helpful if you explained wht you mean by "the server chokes".

It's also very odd to have such a high limit for connections but a very low limit for hysteresis (min/max spare servers).

Although the extract of errors you've provided doesn't show the telltale 'too many open files' I'd start by looking at the number of open file descriptors and the ulimit settings.

Server CHokes as in it doesnt respond to even normal html files.. — john titus, Jul 11 '12 at 11:28

score 0 · Answer 8 · answered Jul 11 '12 at 15:16

0

Perhaps the Apache user is running out of allowed file handles? You didn't mention them at all in your post. How many file handles Apache currently is allowed to have?

answered Jul 11 '12 at 15:16

Janne Pikkarainen

31,852
4
58
81

128192 file handles – john titus Jul 12 '12 at 06:47

score 0 · Answer 9 · answered May 11 '18 at 12:30

This is more like a comment, but as can't as I have less reputation. Came across exactly similar problem as @john titus had.

We made the AJP connector MaxThreads close to our Apache Thread limit to solve the issue.

For monitoring this, we looked for SYN_SENT netstat port status help with netstat command on our AJP port.

netstat -an | grep :8102 | grep SYN_SENT | wc -l

This got down to 0 , which was always some big number before the MaxThread limit set on AJP Connector.

Apache Tomcat chokes after 300 connections

9 Answers9

Linked