3

I've noticed a very peculiar problem with Apache. I have a very high number of virtual hosts set up - it's around 501.

Problems started occuring after vhost number 493. First 493 vhosts work as expected, but as soon as I add vhost number 494, PHP stops communicating with memcache, and it times out on every read/write access.

Actually, I'm using memcache as a backend sessions store, so, php function:

session_start();

simply times out after 30 sec.

If I delete the random one of the 494 vhosts and restart apache, it starts working again.

I've set ulimit really high (65k) but it doesn't help. I've tried switching off ulimits completely, but no luck.

Do you guys have any ideas what more could I try?

I've tried stracing the httpd process to which I'm connected to, after I hit the enter in browser and after 30 sec wait commences.

This is the strace output:

select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})
select(1170, [1024 1169], [], NULL, {1, 0}) = 2 (in [1024 1169], left {0, 999998})

So basically apache is stuck on select(), and that's it, it repeats the select() system call indefinitely.

Next thing I have come up with is tcpdump, to see if the package actually goes through from apache, and it really does:

22:11:28.366677 IP6 ::1.51404 > ::1.11914: Flags [S], seq 2899674987, win 32752, options [mss 16376,sackOK,TS val 1384759049 ecr 0,nop,wscale 9], length 0
22:11:28.366697 IP6 ::1.11914 > ::1.51404: Flags [S.], seq 2034630080, ack 2899674988, win 32728, options [mss 16376,sackOK,TS val 1384759049 ecr 1384759049,nop,wscale 9], length 0
22:11:28.366709 IP6 ::1.51404 > ::1.11914: Flags [.], ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0
22:11:28.366752 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 1:41, ack 1, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 40
22:11:28.366758 IP6 ::1.11914 > ::1.51404: Flags [.], ack 41, win 64, options [nop,nop,TS val 1384759049 ecr 1384759049], length 0
22:11:28.366768 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 41:90, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759049], length 49
22:11:28.366772 IP6 ::1.11914 > ::1.51404: Flags [.], ack 90, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.366779 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 90:122, ack 1, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 32
22:11:28.366783 IP6 ::1.11914 > ::1.51404: Flags [.], ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367063 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 1:12, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 11
22:11:28.367070 IP6 ::1.51404 > ::1.11914: Flags [.], ack 12, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367266 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 12:20, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 8
22:11:28.367275 IP6 ::1.51404 > ::1.11914: Flags [.], ack 20, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367477 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 20:25, ack 122, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 5
22:11:28.367489 IP6 ::1.51404 > ::1.11914: Flags [.], ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 0
22:11:28.367629 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 122:181, ack 25, win 64, options [nop,nop,TS val 1384759050 ecr 1384759050], length 59
22:11:28.367859 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 25:33, ack 181, win 64, options [nop,nop,TS val 1384759051 ecr 1384759050], length 8
22:11:28.367869 IP6 ::1.51404 > ::1.11914: Flags [P.], seq 181:230, ack 33, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 49
22:11:28.368102 IP6 ::1.11914 > ::1.51404: Flags [P.], seq 33:41, ack 230, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 8
22:11:28.368138 IP6 ::1.51404 > ::1.11914: Flags [F.], seq 230, ack 41, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
22:11:28.368195 IP6 ::1.11914 > ::1.51404: Flags [F.], seq 41, ack 231, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0
22:11:28.368206 IP6 ::1.51404 > ::1.11914: Flags [.], ack 42, win 64, options [nop,nop,TS val 1384759051 ecr 1384759051], length 0

Next thing I did is GDB of Apache process when I issue a curl call to a page containing session_start(), and this is the output:

232     *(*new)->local_addr = *sock->local_addr;
241     if (sock->local_addr->sa.sin.sin_family == AF_INET) {
238     (*new)->local_addr->pool = connection_context;
241     if (sock->local_addr->sa.sin.sin_family == AF_INET) {
238     (*new)->local_addr->pool = connection_context;
241     if (sock->local_addr->sa.sin.sin_family == AF_INET) {
245     else if (sock->local_addr->sa.sin.sin_family == AF_INET6) {
246         (*new)->local_addr->ipaddr_ptr = &(*new)->local_addr->sa.sin6.sin6_addr;
249     (*new)->remote_addr->port = ntohs((*new)->remote_addr->sa.sin.sin_port);
250     if (sock->local_port_unknown) {
256     if (apr_is_option_set(sock, APR_TCP_NODELAY) == 1) {
257         apr_set_option(*new, APR_TCP_NODELAY, 1);
266     if (sock->local_interface_unknown ||
267         !memcmp(sock->local_addr->ipaddr_ptr,
266     if (sock->local_interface_unknown ||
276         (*new)->local_interface_unknown = 1;
293     apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup,
292     (*new)->inherit = 0;
293     apr_pool_cleanup_register((*new)->pool, (void *)(*new), socket_cleanup,
296 }   
unixd_accept (accepted=0x7fff14ecddf0, lr=0x7fe93a905aa8, ptrans=<value optimized out>) at /usr/src/debug/httpd-2.2.15/os/unix/unixd.c:507
507     if (status == APR_SUCCESS) {
508         *accepted = csd;
649 }   
child_main (child_num_arg=<value optimized out>) at /usr/src/debug/httpd-2.2.15/server/mpm/prefork/prefork.c:650
650         SAFE_ACCEPT(accept_mutex_off());      /* unlock after "accept" */
652         if (status == APR_EGENERAL) {
656         else if (status != APR_SUCCESS) {
665         current_conn = ap_run_create_connection(ptrans, ap_server_conf, csd, my_child_num, sbh, bucket_alloc);
666         if (current_conn) {
667             ap_process_connection(current_conn, csd);

At this location there's a big pause (~30 sec), until php times out. After that, I got this:

668             ap_lingering_close(current_conn);
676         if (ap_mpm_pod_check(pod) == APR_SUCCESS) { /* selected as idle? */
680                  ap_scoreboard_image->global->running_generation) { /* restart? */
679         else if (ap_my_generation !=
680                  ap_scoreboard_image->global->running_generation) { /* restart? */
679         else if (ap_my_generation !=
551     while (!die_now && !shutdown_pending) {
559         apr_pool_clear(ptrans);
562              && requests_this_child++ >= ap_max_requests_per_child)) { 
561         if ((ap_max_requests_per_child > 0
562              && requests_this_child++ >= ap_max_requests_per_child)) {
561         if ((ap_max_requests_per_child > 0
562              && requests_this_child++ >= ap_max_requests_per_child)) {
561         if ((ap_max_requests_per_child > 0
566         (void) ap_update_child_status(sbh, SERVER_READY, (request_rec *) NULL);
573         SAFE_ACCEPT(accept_mutex_on());
575         if (num_listensocks == 1) {

The weirdest thing is I cannot reproduce this on another machine. Same OS, same packages, same config (puppet) same kernel, different HW.

Jakov Sosic
  • 5,267
  • 4
  • 24
  • 35
  • What are your Apache settings for the number of processes and threads it starts, and PHP-FCGI settings for how many processes it keeps per VirtualHost? – rightstuff Oct 03 '14 at 16:37
  • I use a mod_prefork coupled with mod_php. – Jakov Sosic Oct 03 '14 at 16:48
  • Seems like your tracing the parent process instead of the process handling your request. Also, check the limits at runtime using /proc so you're not mislead about effective ulimits. – covener Oct 06 '14 at 13:18
  • I straced a process by looking at lsof output after establishing connection, not a parrent. I have also checked /proc//limits to be certain about efective limits. – Jakov Sosic Oct 06 '14 at 13:47

1 Answers1

2

After couple of more weeks of debugging and watching out for problems, I finally stumbled upon a message:

You MUST recompile PHP with a larger value of FD_SETSIZE.
It is set to 1024, but you have descriptors numbered at least as high as 1073.
--enable-fd-setsize=2048 is recommended, but you may want to set it to equal
the maximum number of open files supported by your system, in order to avoid
seeing this error again at a later date. 

I will try this fix, but boy oh boy, why do PHP guys do this? This is so ugly, hardcoding nofile limit is utterly broken design. Not to mention that if this is the solution, forcing me to recompile each PHP minor version and security patch and maintain my own packages is a big big hustle.

EDIT: after more extensive debugging, it seems it's not only PHP that's 'broken by design', there's also a bunch of issues with memcache extension itself.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629896

https://bugs.php.net/bug.php?id=59876

Bugs are open for quite a while now, and nothing happens. I guess one should just dump memcache extension and find a solution independent from it :-/

Jakov Sosic
  • 5,267
  • 4
  • 24
  • 35
  • I suppose there are circumstances where you don't care, but sharing a session store between virtual hosts sounds nasty from a security perspective. – mc0e Dec 03 '14 at 00:17
  • This is actually a dev machine with lot of vhosts, but on the other hand, even if you have one and exaust 1024 FDs, memcache extension breaks. – Jakov Sosic Dec 03 '14 at 00:20
  • Apart from lots of vhosts, what would be using so many FDs? (and can you arrange that differently?) – mc0e Dec 03 '14 at 00:26