1

I have 2 Debian Linux 6.0.4 servers that have a strange behaviour: after 5-7-10 days they hang. By this I mean the servers need to be restarted and before that ping won't answer.

I've been struggling with this problem for a couple of months now and here's some thoughts/what I tried without being able to solve the problem.

  • I changed the RAM on a server. Being 2 different servers I doubt that it could be something related to hardware as a 3rd identical server won't have this problem.
  • I logged the server load and when it crashes the load is fine (quite low)
  • I cannot find anything in the server logs, logs are fine till the server freezes.
  • I don't have access to console unfortunately.

While I have years of admin experience I have never encountered such an issue and right now I have no idea where else to investigate.

If you have an idea of what I could try in order to fix the problem please share it with me:-)

nwalke
  • 643
  • 2
  • 12
  • 32
Alex Flo
  • 1,761
  • 3
  • 18
  • 23

3 Answers3

1

Do the servers really hang or are they just unreachable by ping?

Install a monitoring tool such as Munin (or similar) which will show you graphs of not just the CPU load but also memory stats, disk usage, and various other bits and pieces - you can configure it to monitor lots of aspects. Nex time the server hangs, check the graphs for any unusual signs. You will learn to see what a normal graph looks like so anything out of the ordinary is suspicious (although not necessarily wrong).

Are you sure you are checking all server logs? ie do you have web/mail/ftp/dns/other servers? check all such logs! Don't forget to enable debug logging while troubleshooting.

If the server crashes every week or so it could be something that happens regularly, ie a cron job, or log rotation, stuff like that.

Make sure you get all system emails (root alias). You can even install OSSEC which is a great tool for keeping an eye on the logs and getting emails when things go wrong. But this OSSEC tool is just an automated way of looking at the logs, so nothing magical.

Networking issues? dhcp lease expired?

snostorm
  • 21
  • 3
  • I believe the server hangs because I cannot see anymore messages logged besides not responding to ping. – Alex Flo Jun 27 '12 at 07:10
  • I looked at (all) the logs and they looked like the server was plugged out of the socket (though this isn't the case I believe:) – Alex Flo Jun 27 '12 at 07:12
  • **dhcp** if that was the case then I would have seen something in logs I believe, not to mention that even if the server was not accessible because of a networking issue my loggings should have worked just fine which is not the case. – Alex Flo Jun 27 '12 at 07:14
1

Please show relevant content of /var/log/messages and/or /var/log/kern.log it's possible the kernel logged some crash reports or something else that could shed some light. When I experienced such unexplained hangs it was due to a bad driver, because logging isn't very verbose I wasn't able to find out the exact driver.

In my case there were soft lockups (kernel: [XXXX] BUG: soft lockup - CPU#X). After some research I found http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030 and the last comment provided some insight and a way to make logging more verbose. It's an easy kernel modification but if you don't feel comfortable compiling your own kernel it may not be the best thing to do.

Just updating the kernel or installing a newer version and rebooting may fix the problem.

Quoting:

We extensively researched the problem.

The TLB flush softlockup is only a CONSEQUENCE of a deadlock.

Background: The TLB flush is issued by a CPU to a number of other CPUs using inter-processor interupts to progagate paging changes. Then the issuing CPU loops until all processor acknowledge the change. If such processor is in deadlock on a spinlock, this never hapens, then the softlockup triggers. The deadlock arise on a spinlock, this lock may be held by user code sometimes (through /proc or /sys interfaces of modules).

The only way to identify the root cause (i.e. which driver is causing problems) is to dump ALL CPU stacks in the soft lockup code.

One way to do that is to modifiy the kernel and add

            arch_trigger_all_cpu_backtrace() 

in the

            kernel/softlockup.c:softlockup_tick() 

function.

This is based on NMI IPI which ensure all stacks are dump, even in the case of deadlock (well don't expect the impossible to happen either).

You should easily find the faulty driver and post the relevant bug.

aseq
  • 4,610
  • 1
  • 24
  • 48
  • thanks for this suggestion but I have a 3rd server with exactly the same HW and the very same Debian version (same kernel version as well) and that server doesn't have this issue so I tend to believe its not a kernel problem. – Alex Flo Jun 27 '12 at 07:19
  • anyway, I somehow believe it could be related to some Python scripts that I run on these 2 servers which hang: they could exhaust the maximum number of file descriptors. Would that be possible? And if so would ping stop responding? – Alex Flo Jun 27 '12 at 07:20
0

Apparently the problem was related to some python scripts that caused the server to hang. I don't understand why they hanged the server but at least they don't hang it any more.

Alex Flo
  • 1,761
  • 3
  • 18
  • 23