I'm getting a bunch of apache errors that I'm having problems tracing down. They're on a RHEL system that runs a very high-volume Drupal website.
[Mon Sep 14 12:48:44 2009] [info] [client xx.xx.xxx.xx] (70007)The timeout specified has expired: core_output_filter: writing data to the network [Mon Sep 14 12:50:19 2009] [info] [client xx.xxx.xx.xx] (104)Connection reset by peer: core_output_filter: writing data to the network [Mon Sep 14 12:51:28 2009] [info] [client xx.xxx.xx.xx] (32)Broken pipe: core_output_filter: writing data to the network
Occasionally (every 24 to 36 hours) there will be a load spike and the site will become completely unresponsive. Load average climbs from a normal 1-1.5 to 200. Most of the httpd processes that are running will show as 'D' -- deadlocked -- and the only way to get the server to get back down to "interactive" is to three-finger-salute or wait until you get a prompt and killall -9 httpd
.
Obviously, the site can't be taken down for me to do a bunch of strace work. I've checked the apache configuration and (again) as far as I can tell, EnableMMAP and EnableSendFile are disabled. The files are on an NFS v3 mount, but neither the NFS server, nor the mysql server, nor anything else, is reporting errors. Nothing appropriate in the system log or dmesg. The site is also too high of a load to reconcile individual requests with errors resulting from them.
At this point, I'm thinking network hardware error and I'd prefer to bring the site up on a second machine. Anyone have any thoughts before I do this?