0

Once or twice per day I find a server I have powered off with no apparent reason.

Info and what I've done until now:

  • Nothing is being reported under /var/log/. Just normal server activity and then the startup logs when I manually power on the machine.
  • sensors always give me normal temperature values which remain normal throughout all days in which the problem occurs: http://pastebin.com/gk8JuPCK
  • By physically inspecting the PSU (Thermaltake) and other parts of the tower I find nothing worrying. The inside is pretty clean (dust free) and all fans are working problem free.
  • In the BIOS settings there is an alert configured for when the CPU reaches 60c but that is very high. Also note that the setting is at "alert" and there isn't a "turn off" setting as I remember from other BIOSes.
  • I've memtested the whole memory many times without a single problem. Also I don't think it's a memory problem since I've never found the server in a halted or crashed state, but always powered off.
  • The server is connected on a UPS which supplies other similar servers as well. The other servers had never had this problem. I've even exchanged the power cables and UPS outputs between 2 servers and the very same server had this problem again. So it is not a matter of UPS.

Where should I look next?

Server info:

AMD 64 Processor 3500+
2 x 512MB
mainly runs SVN and DNS. No X sessions take place and no users are logged in.

cat /proc/version

Linux version 2.6.26-1-686 (Debian 2.6.26-13) (waldi@debian.org) (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-24)) #1 SMP Sat Jan 10 18:29:31 UTC 2009
cherouvim
  • 794
  • 3
  • 21
  • 37

2 Answers2

3

The only reasons I can think of now and you didn't mention are:

  • wrong watchdog setting in your system (either BIOS/HW level or in kernel/userspace),
  • HW problem (I would bet on malfunctioning power supply) - had the same problem once on one customer HP tower server
Kamil Šrot
  • 333
  • 1
  • 3
  • 10
1

try to find sysstat. sysstat is a tool which collects system data (e.g. CPU, RAM, i/o usage) in regular intervals. Its output is also a valuable source of information when it comes to troubleshooting crash situations. Please consider to install the package sysstat and enable its service by using

chkconfig boot.sysstat on /etc/init.d/boot.sysstat start

infaustus
  • 133
  • 4