2

I have written a webapp in Django that I host under Apache 2.4 using mod_wsgi.

At first glance, it works fine. However, when the Apache process has been running for a while, the webapp starts responding with 503 errors. You can get through sometimes after F5'ing a lot.

The annoying thing is that I dont have anything related to this in my Apache error.log at all, so I don't know where to start debugging this. My only lead is the error displayed in the browser: 503 Service Unavailable.

Other websites (PHP) are not affected. Restarting the Apache server reliably resolves the issue immediately.

My Apache config for this webapp:

    WSGIDaemonProcess app python-path=/opt/app home=/opt/app
    WSGIProcessGroup app
    WSGIScriptAlias /app /opt/app/wsgi.py

I have found this but I don't think that's the problem because the sockets are already placed in /var/run/apache2/ in my case (not the "apache log directory"). This directory is readable to others. The sockets themselves have mode 700, but have the correct owner (www-data).

Any ideas?

EDIT:

I noticed that when the problem occurs and I restart Apache to fix it, Apache takes quite long to exit. In the error log I found these messages:

[Sun Jun 25 14:14:18.935566 2017] [core:warn] [pid 374:tid 140587055682752] AH00045: child process 21761 still did not exit, sending a SIGTERM
[Sun Jun 25 14:14:18.935637 2017] [core:warn] [pid 374:tid 140587055682752] AH00045: child process 21812 still did not exit, sending a SIGTERM
[Sun Jun 25 14:14:20.937578 2017] [core:warn] [pid 374:tid 140587055682752] AH00045: child process 21761 still did not exit, sending a SIGTERM
[Sun Jun 25 14:14:20.937640 2017] [core:warn] [pid 374:tid 140587055682752] AH00045: child process 21812 still did not exit, sending a SIGTERM
[Sun Jun 25 14:14:22.939827 2017] [core:warn] [pid 374:tid 140587055682752] AH00045: child process 21761 still did not exit, sending a SIGTERM
[Sun Jun 25 14:14:22.939893 2017] [core:warn] [pid 374:tid 140587055682752] AH00045: child process 21812 still did not exit, sending a SIGTERM
[Sun Jun 25 14:14:24.942034 2017] [core:error] [pid 374:tid 140587055682752] AH00046: child process 21761 still did not exit, sending a SIGKILL
[Sun Jun 25 14:14:24.942176 2017] [core:error] [pid 374:tid 140587055682752] AH00046: child process 21812 still did not exit, sending a SIGKILL

So by the looks of it, the problem might be caused by (WSGI?) threads that are hung? How do I proceed to debug this?

Compizfox
  • 384
  • 1
  • 6
  • 18
  • Generally when Apaches does not log an error is because that error comes directly from a backend server. Errors as 503 are always logged. Perhaps you want to review the docs of mod_wsgi for troubleshooting its behaviour. – Daniel Ferradal May 13 '17 at 19:19
  • @ezra-s Now I think about it, when I get the 503, there's nothing in the access log either. It does log the access when the request goes through though. – Compizfox May 13 '17 at 19:24
  • @ezra-s The only thing related to 503s I could find in the mod_wsgi docs is the thing about the sockets, as described above. If the error came directly from the Django app, Apache would log the access to `access.log` (like it does for requests that go through without problems), right? – Compizfox May 13 '17 at 19:38
  • if you have no entry in access.log then you should not be reaching your server at all, or the log is not defined in the appropiate context, all requests are logged in access.log except if conditional logging is set on purpose. – Daniel Ferradal May 13 '17 at 19:56
  • @ezra-s Weird. It must reach the server; I get a 503 back from Apache (I'm sure there is no proxy in front or it or anything). – Compizfox May 13 '17 at 22:40
  • This one helped me: https://serverfault.com/a/909023/573987 (exactly the same Problem, but on Raspian – bb1950328 May 09 '20 at 15:34

1 Answers1

0

This is 5 years late, but I may have some insight to share:

The immediate issue is that Apache does not appear to be logging the 503s. What I found was that Apache was logging the 503s, but attempting to do so to a deleted file. When an Apache server process starts up, it gets a file handle for the file it's going to write logs to. Importantly, it does not update that file handle; the expectation is that if the file is moved/deleted, the server will be restarted (as explained in the docs).

A common setup is to use Linux's logrotate command to automatically rotate the Apache access + error logs every night. As part of this, the config script does a graceful restart of Apache server, so that the new server processes write to the new log file. However, if for some reason, the graceful restart fails to kill+restart some of the server processes, those processes will still be writing to the old logfile, which is often automatically deleted after a day. Thus if you're looking into this issue after that has happened, there will be no 503s to see! They are being written to an old, perhaps nonexistent, log file.

Why would a graceful restart fail to kill some of the server processes? I'm not 100% sure, but my best guess is that it's because of this bug in Apache 2.4: https://bz.apache.org/bugzilla/show_bug.cgi?id=63169

Beyond just the logging issue, this is the real issue behind the 503s as far as I can tell. There is a process stuck on an old generation of the server, and if load gets high enough, any requests to that process will 503 because of some interaction with mod_wsgi that I don't understand or care about.

Some potential actions to take based on all of that information:

  1. If you're able, try updating to Apache 2.49. That's not an option in my environment, so I'm not sure if that helps or not, though.
  2. Consider Apache's Piped Logs, an alternative for writing to log files that doesn't require a server restart. Docs here. I don't think it would fix the 503s, but at least logging would work properly.
  3. Consider increasing the MinSpareThreads value in your mpm_event.conf (or mpm_worker.conf). If the Apache bug I linked to is actually the issue, then the problem is that there are spikes of traffic that create too many processes, and Apache forgets about them when restarting gracefully. Anecdotally, this solution alone seems to have fixed the issue for me, but it's only been a few days.

Here are some signals that you might be having the same issue I'm describing:

  1. If you run lsof | grep /var/log/apache2 (or wherever your logs are) there will be at least one process that has an open file handle to a (deleted) log file.
  2. There is a server process in Apache's server-status that is marked as (old gen) but is not marked as stopping.
  3. If you do a hard restart of Apache, the problem should be fixed until at least the next logrotate runs.
samGbos
  • 101