Server suddenly stops responding and then resumes an hour later

Question

My FreeBSD server had been perfectly working for over 2 years without any major changes to the system. Recently I installed SSL certificate using Apache's mod_ssl, and after 10 days of running fine the server suddenly started crashing.

When the server crashes:

HTTPS and SSH become unresponsive instantly
PING slows down to thousands of milliseconds before stopping responding as well

After 15-60 minutes of being unreachable:

Server suddenly resumes and starts working with full speed - as nothing had happened
Then in 15-60 minutes it crashes again and the cycle repeats

What I checked:

When I restart the server, nothing changes - it remains unreachable
CPU / RAM / HDD usage - OK (< 50%, including peak hours)
Traffic has no affect - happens any time of the day, including 4am
Disabling Firewall didn't help

In httpd-error.log I found:

[notice] Digest: generating secret for digest authentication ...
[notice] Digest: done
[notice] Apache/2.2.23 (FreeBSD) mod_ssl/2.2.23 OpenSSL/0.9.8q DAV/2 configured -- resuming normal operations
[error] server reached MaxClients setting, consider raising the MaxClients setting

I tried enabling KeepAlive and substantially (4x) increasing MaxClients size, but this did not solve the problem:

Timeout 120
KeepAlive On
KeepAliveTimeout 5
MaxKeepAliveRequests 1000

<IfModule mpm_prefork_module>
    StartServers          50
    MinSpareServers       128
    MaxSpareServers      1024
    ServerLimit      1024
    MaxClients          1024
    MaxRequestsPerChild   1000
</IfModule>

In /var/log/messages just before the first crash I found:

kernel: mfi0: 228755 (454057919s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: mfi0: 228756 (454057984s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: mfi0: 228757 (454058049s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
kernel: mfi0: 228758 (454058114s/0x0008/FATAL) - Battery needs replacement - SOH Bad
kernel: mfi0: 228759 (454058179s/0x0008/FATAL) - Battery needs replacement - SOH Bad

The "Battery needs replacement" warning disappeared after the first restart, but arp message keeps appearing in the logs at about the same interval the server crashes:

May 23 05:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 05:00:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:25:90:02:08:fc on ix0
May 23 05:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 05:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 05:32:44 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 05:40:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 05:40:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 05:40:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 05:52:40 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 06:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 06:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 06:00:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 06:00:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 06:20:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:03 on ix0
May 23 06:20:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 06:30:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:25:90:02:08:fc on ix0
May 23 06:32:36 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:03 to 00:07:b4:00:00:01 on ix0
May 23 06:50:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 06:50:01 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 07:00:02 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:03 to 00:25:90:02:08:fc on ix0
May 23 07:12:28 ns228407 kernel: arp: 176.31.237.254 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0
May 23 07:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:25:90:02:08:fc to 00:07:b4:00:00:01 on ix0
May 23 07:20:00 ns228407 kernel: arp: 176.31.237.251 moved from 00:07:b4:00:00:01 to 00:07:b4:00:00:03 on ix0

What should I do next to find and solve the problem?

What equipment is using the IP addresses that keep moving around? — Jenny D, May 23 '14 at 08:16

score 4 · Accepted Answer · answered May 23 '14 at 08:24

The last thing you should do now is increase MaxClients.

It's rather hard to tell. The slowdown and MaxClients warnings suggest that you're getting too much demand for the server to cope with. Unless you run a lot of AJAX/COMET stuff on the server then you really should reduce the keepalive timeout (to, say, 2 initially).

The "Battery needs replacement" is not just a reminder to do some maintenance - on a BBWC this means that the controller is no longer attempting to cache writes - and if your system is setup properly then your OS and disks won't be caching writes either.

Both indicate that the perforance of your system should be appalingly bad - yet the first thing you report is that it apears to be unavailable - indeed you make no mention of performance - knowing how to measure performance and capturing the data should be high on your agenda.

I'm not sure why the address keeps moving (I assume these are local interfaces) - it may be a consequence of the load elsewhere.

This is one sick puppy - and you're going to have to start fixing one thing at a time until you get a clearer picture of what's going wrong.

Start by switching the battery, tuning the apache install and logging performance metrics.

Definitely start with the battery. The shole performance stack goes into safety mode with this warning and performance basically out of the window. And learn measureing - CPU below 50% including peaks can mean it is critically overloaded, depending on distribution over the day. — TomTom, May 23 '14 at 09:30

Server suddenly stops responding and then resumes an hour later

1 Answers1