2

We run apache2 server with multiple sites (each has own domain or subdomain) using SSL (also multiple certificates on same IP but mostly an asterisk certificate for subdomains). Now we are having a problem that when there are more than 20-30 individual sites per server, restart takes over 20 seconds.

Logs don't show anything and I am not sure how to figure out what takes so long to restart. This may be connected - running apache2ctl -S also takes roughly same amount of time (When I run one restart after another or one -S after another its fast <1s, if I wait a minute or so its again slow).

How can I trouble shoot this issue? How to determine what causes those slow restarts (we need to restart when we add new sites and its slowly becoming unmanageable).

-- Update --

So, after all this time, this is still a problem.

There has been some changes that may point me in a right direction:

One of the servers all of the sudden started to work as fast as I would expect it too. Not sure why it happened as I cant pinpoint when exactly this happened. I compared apache configs and they are pretty much identical on all servers but one now restart fast and other two are still slow (>2min).

Now, just recently when troubleshooting something else I came across some comments about ipv6 setup slowing down certain certificate issuance etc. I have tested ipv6 connectivity and only discrepancy I found is that on a fast server when I try this:

wget -6 https://google.com/

I get this:

--2017-06-09 07:49:32-- https://google.com/ Resolving google.com (google.com)... 2a00:1450:4009:809::200e Connecting to google.com (google.com)|2a00:1450:4009:809::200e|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://ipv6.google.com/sorry/index?continue=https://google.com/&q=EhAqAX4AAAAAAPA8kf_-iVh9GIym6ckFIhkA8aeDS9iurm4ysYfhKSPR5hfHhPY5mBEsMgNyY24 [following] --2017-06-09 07:49:33-- https://ipv6.google.com/sorry/index?continue=https://google.com/&q=EhAqAX4AAAAAAPA8kf_-iVh9GIym6ckFIhkA8aeDS9iurm4ysYfhKSPR5hfHhPY5mBEsMgNyY24 Resolving ipv6.google.com (ipv6.google.com)... 2a00:1450:401b:802::200e Connecting to ipv6.google.com (ipv6.google.com)|2a00:1450:401b:802::200e|:443... connected. HTTP request sent, awaiting response... 503 Service Unavailable 2017-06-09 07:49:33 ERROR 503: Service Unavailable.

While on the slow servers same request produces this:

--2017-06-09 07:50:37--  https://google.com/
Resolving google.com (google.com)... 2404:6800:4003:802::200e
Connecting to google.com (google.com)|2404:6800:4003:802::200e|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.google.com/ [following]
--2017-06-09 07:50:37--  https://www.google.com/
Resolving www.google.com (www.google.com)... 2404:6800:4001:80b::2004
Connecting to www.google.com (www.google.com)|2404:6800:4001:80b::2004|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'index.html'
[ <=>    ] 10,382      --.-K/s   in 0.001s

2017-06-09 07:50:37 (18.4 MB/s) - 'index.html' saved [10382]

So in other words there appears to be some issue with ipv6 that possibly causes timeout when trying to reboot apache?

I also figured out the following - its a shutdown that takes all the time. If I shutdown apache and then try to start it starting is fast.

To answer previous questions:

Ciphers are optimised (all sites getting A on ssl test). Traffic is not great. Here is the server-status dump before a 2 min restart:

Apache Server Status for s3.example.com (via 0.0.0.0)

Server Version: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.21 OpenSSL/1.0.1f
Server MPM: prefork
Server Built: May 9 2017 16:14:10

Current Time: Friday, 09-Jun-2017 08:05:46 UTC
Restart Time: Friday, 09-Jun-2017 07:57:35 UTC
Parent Server Config. Generation: 1
Parent Server MPM Generation: 0
Server uptime: 8 minutes 10 seconds
Server load: 0.00 0.00 0.00
Total accesses: 15 - Total Traffic: 23 kB
CPU Usage: u0 s0 cu0 cs0
.0306 requests/sec - 48 B/second - 1570 B/request
1 requests currently being processed, 7 idle workers

____W___........................................................
................................................................
......................

Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process

Srv PID Acc M   CPU     SS  Req Conn    Child   Slot    Client  VHost   Request
0-0 11047   0/2/2   _   0.00    456 0   0.0 0.00    0.00    77.0.0.0    s3.example.com:80   NULL
2-0 11049   0/3/3   _   0.00    219 0   0.0 0.00    0.00    127.0.0.1   s3.example.com:80   GET /server-status?auto HTTP/1.1
4-0 11051   0/7/7   W   0.00    0   0   0.0 0.01    0.01    77.0.0.0    s3.example.com:443  GET /server-status HTTP/1.1
6-0 11166   0/2/2   _   0.00    35  0   0.0 0.00    0.00    127.0.0.1   s3.example.com:443  \x16\x03\x01
7-0 11167   0/1/1   _   0.00    333 1   0.0 0.00    0.00    77.0.0.0    s3.example.com:443  NULL
Srv Child Server number - generation
PID OS process ID
Acc Number of accesses this connection / this child / this slot
M   Mode of operation
CPU CPU usage, number of seconds
SS  Seconds since beginning of most recent request
Req Milliseconds required to process most recent request
Conn    Kilobytes transferred this connection
Child   Megabytes transferred this child
Slot    Total megabytes transferred this slot
SSL/TLS Session Cache Status:
cache type: SHMCB, shared memory: 512000 bytes, current entries: 0
subcaches: 32, indexes per subcache: 88
index usage: 0%, cache usage: 0%
total entries stored since starting: 0
total entries replaced since starting: 0
total entries expired since starting: 0
total (pre-expiry) entries scrolled out of the cache: 0
total retrieves since starting: 0 hit, 1 miss
total removes since starting: 0 hit, 0 miss
Apache/2.4.7 (Ubuntu) Server at s3.example.com Port 443

Please advise...

RandomWhiteTrash
  • 269
  • 1
  • 3
  • 17
  • Why do you restart instead of reload? – Michael Hampton Sep 20 '16 at 07:27
  • I think when adding new certificates reload was not always working as it should. Besides reload takes as long as restart. Sorry didn't mention that. – RandomWhiteTrash Sep 20 '16 at 07:31
  • 1
    If you use 20-30 SSL certificates depending on the encryption algorithms and ciphers you use it's perfectly normal to take some time. However, you can and probably should reduce the amount of ciphers you use by disabling the less secure ones. You probably should test one or all of your sites with SSL-Test. If your server supports less secure ciphers (aka you get a bad rating) you should remove those ciphers, not only to speed up apache, but also to secure the connections of your sites. https://www.ssllabs.com/ssltest/ – Broco Sep 20 '16 at 11:17
  • At least mention the user amount connected when that happens, the SSL configuration, the size of the keys in your private keys (eg: 4096 adds a large overhead), and everything the previous comments say. – Daniel Ferradal Sep 22 '16 at 18:00
  • See update in the question – RandomWhiteTrash Jun 09 '17 at 08:08

3 Answers3

2

Use strace when restarting your httpd: http://man7.org/linux/man-pages/man1/strace.1.html

It should tell you what the process is doing in the waiting period, so it might give you a better clue as to what the cause is.

Ricky
  • 81
  • 5
  • All I got is this : `41.180062 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=21897, si_status=0, si_utime=4, si_stime=7} ---` When it restarts fast this is what I got... pretty simmilar: `0.213050 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=20109, si_status=0, si_utime=4, si_stime=4} --- ` Can I somehow debug it more indepth? – RandomWhiteTrash Jun 09 '17 at 08:59
  • You should be getting a lot more output than that, what command are you executing exactly? – Ricky Jun 09 '17 at 09:05
  • strace -ro strace.log.1 apache2ctl graceful – RandomWhiteTrash Jun 09 '17 at 09:14
  • Oh I am sorry, I may have been imprecise. I do get a lot more but this those elements are pretty much the same, This is the only line that differs greatly in execution times. – RandomWhiteTrash Jun 09 '17 at 09:17
  • I can see on your first line that it took 41 seconds before it executed, what were some of the previous lines before that? – Ricky Jun 09 '17 at 09:24
  • Here is a full thing: https://pastebin.com/3nkC3Ldg – RandomWhiteTrash Jun 09 '17 at 09:27
  • I also notice you're gracefully restarting. It would not be unusual for this to take a long time on a high traffic server. What happens if you execute a non-graceful (normal) restart? – Ricky Jun 09 '17 at 09:27
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/60173/discussion-between-ricky-and-randomwhitetrash). – Ricky Jun 09 '17 at 09:29
2

The IPv4/IPv6 clue remimnds me of the glibc problem in Bug 459756 - DNS resolver library doesn't seem to be working reliably

(also RH Knowledgebase DOC-58626, I don't have access)

In short, RHEL6 (and Fedora, Centos, Arch) were performing parallel IPv6 and IPv4 DNS resolution and unpredictable results as IPv6 deployment was sometimes less reliable.

Possible workarounds are to:

  • add "options single-request-reopen" to /etc/resolv.conf. This forces the A and AAAA requests to share a sockest. resolv.conf man page here
  • run a local caching nameserver like [nscd] (or dnsmasq, not included) 3
quadruplebucky
  • 5,139
  • 20
  • 23
  • That's what came to my mind, too. It is rather simple to test this. Just time `wget http://www.google.com/` vs. `wget -4 http://www.google.com/`. If the latter is much faster you're probably hitting a firewall/dns issue – Andreas Rogge Jun 14 '17 at 12:35
0

Others have suggested strace, but haven't filled in the most useful arguments to start it with given your issue, or given a nicer framework for using your shell history to reinvoke it over and over... which is quite a common use case. ;)

This process may seem excessive, but I like to do everything to capture my data first when possible, rather than iteratively figure out that I've missed something because I try to go too fast to get an answer.

I always strace with something like this (this example is for a daemon called "httpd", adjust as needed):

daemon="httpd"; rm /tmp/strace.* ; service $daemon restart ; (strace -v -a 40 -f -ff -tt -yy -s 128000 -o /tmp/strace -p "`pidof $daemon`" & ) && sleep 1 && service $daemon restart

Then, cd to /tmp/. You may have a ton of "strace.*" files, but you are really likely only interested in the most recently touched. So, an

ls -latr /tmp/

should tell you your most interesting files to inspect. Look for obvious system errors, or large gaps in time, or timeouts. Have fun! :)

Jesse Adelman
  • 978
  • 5
  • 15