Why does my server accidentally go down?

Question

I have CentOS 5.3 based server with kernel 2.6.18-128.2.1.el5. It worked fine nearly for a month, but this week it went down three times. I saw it in Nagios, write a email to reboot the server. It worked 12-36 hours and then went down again.

I look through log files. Just before first fault in /var/log/messages was this message:

logrotate: ALERT exited abnormally with [1]

After rebooting the server the second time the sysadmin from datacenter send me this screenshot: alt text http://www.freeimagehosting.net/uploads/bd9fb68d98.png Before the third fault in /var/log/messages was message:

Eeek! page_mapcount(page) went negative (-1)

How should I investigate the problem?

UPD:

Part of the memtester output:

Compare OR          : FAILURE: 0x7e9f90d1 != 0x7e9fd2d1 at offset 0x06222609.
FAILURE: 0x7e9f90d1 != 0x7e9fd0d1 at offset 0x06222621.
FAILURE: 0x7e9f90d1 != 0x7e9fd1d1 at offset 0x06222661.
FAILURE: 0x7e9f90d1 != 0x7e9f92d1 at offset 0x06222681.
FAILURE: 0x7e9f90d1 != 0x7e9fd0d1 at offset 0x062226a1.
FAILURE: 0x7e9f90d1 != 0x7e9fd0d1 at offset 0x062226c1.
FAILURE: 0x7e9f90d1 != 0x7e9f93d1 at offset 0x062226e9.

It is faulty memory. Thank you for help!

score 3 · Accepted Answer · answered Jul 30 '09 at 12:30

3

My first guess is that Nagios has a small memory leak and after months of running ran out of RAM or swap. However, since the machine has crashed a few times in the same day, that suggests a faulty RAM chip. My first step would be to do a memory test or check the bad memory log (if your server supports it).

answered Jul 30 '09 at 12:30

TomOnTime

7,945
6
32
52

How can I perform this test? Server is in datacenter. – lexsys Jul 30 '09 at 12:38
You would need to get the DC staff to run a memory test. You could try a userspace memory tester like http://pyropus.ca/software/memtester/ but a clean-boot-with-memtest86+-or-similar is what you really want and you'll not be able to do that yourself remotely (unless it is a boot option on the machine and you have KVM-over-IP access). – David Spillett Jul 30 '09 at 13:04

score 2 · Answer 2 · answered Jul 30 '09 at 12:37

2

I vote faulty ram too. I would recommend using memtest86 to do a thorough check of the ram. Also, are the temperatures in the room nice and cool?

answered Jul 30 '09 at 12:37

Kyle Brandt

83,619
74
305
448

score 1 · Answer 3 · answered Jul 30 '09 at 12:47

1

I vote faulty RAM too. If you cannot use memtest86 because the machine is remotely located, you may want to try a userspace tool - memtester, instead. It doesn't work quite as well but may be able to pick up some memory errors if they are there.

answered Jul 30 '09 at 12:47

sybreon

7,405
1
21
20

score 0 · Answer 4 · answered Jul 30 '09 at 12:26

0

Just a quick glance it looks like the process that paniced was Nagios. Has that been consistent every time it's paniced and locked up? If so I would ask if the problems started around the time you setup Nagios. If that's the case then you might want to try shutting Nagios down and see if the server returns to be stable. If it does then you have found the culprit and need to look closer to see what's wrong with Nagios.

answered Jul 30 '09 at 12:26

Jeremy Bouse

11,341
2
28
40

1

Nagious is a userspace process. It ain't going to panic the kernel – goo Jul 30 '09 at 12:29
After the second crash I turned nagios off. It didn't help. – lexsys Jul 30 '09 at 12:31

score 0 · Answer 5 · answered Jul 30 '09 at 12:36

Google or Centos forums/list are likely to be you best bet. Without a crsah dump it's going to be difficult to be sure, so you should look into getting that configured.

You can also search through Redhat bugzilla. This looks a possibility based on the little you have from the screen shot.

Why does my server accidentally go down?

5 Answers5