0

I'm running CentOS 5.5 on a server. It runs several VMware virtual machines and an NFS server.

Occasionally, like today, it hangs. There's nothing in /var/log/messages that indicates any problem. (I did notice that /var/log/messages is not in time order.)

Any suggestions where to look for the cause?

Michael Eager
  • 121
  • 1
  • 4
  • What do you mean by "/var/log/messages is not in time order"? The log entries should be (must be?) chronological. There may be some shifts if you are running ntp but those should be small and infrequent. – uesp Mar 08 '11 at 17:23
  • By "hangs" what do you mean? Does it become unresponsive for a period of time? Does the machine go back to normal after this period of time? – xeon Mar 08 '11 at 19:21
  • Looks like log out of order was caused by the system clock being off. It reset during the reboot. Unresponsive -- no video, no response to keyboard, ping, ssh. Does not recover. Fixed by reboot. Not CPU/memory/network bound. – Michael Eager Mar 11 '11 at 00:32

5 Answers5

2

That's not a lot of information to diagnose by. If the system genuinely hangs - that is, becomes unresponsive on the network and at a local console - and there's nothing in syslog or dmesg to provide a cluse, then I would presume you have encountered a hardware fault, and would begin running diagnostic tools from your hardware vendor. Bad RAM or CPU could certainly cause this type of behaviour.

Jeff Albert
  • 1,987
  • 9
  • 14
  • Yes, likely hw problem. The question is which component. I'm looking for diagnostics to set up so that I'll get a clue next time it crashes. – Michael Eager Mar 11 '11 at 00:34
  • Well, a genuine hardware fault like an intermittent failing CPU isn't gonna give you much in the way of logs down at the OS level. You need to run whatever diagnostic tools your hardware vendor provides to audit the system and determine what's broken. If you have warranty support, now is probably the time to call it in. – Jeff Albert Mar 11 '11 at 18:01
2

Problem was RAM failures. Ran memtest86 and discovered failures. RMA'ed the DIMMs and got new ones. Some of which also had memory failures. RMAed those as well, now everything is stable.

Michael Eager
  • 121
  • 1
  • 4
0

If you are running a window manager (gnome or kde) I have seen issues with machines hard locking.

The issue was the gnome screensaver causing some kind of issue where the machine would just completely lock up and stop responding to any connection. After disabling the screensaver the lockups stopped.

Take a look at the xorg logs and the gdm logs (if you are using gnome).

Also, check the timestamps on all of your log files in /var/logs and see if any of the logs are being written to at the time of your lock up.

Have you looked at cron? Could a process be running automagically and causing the lockup?

Mike
  • 802
  • 4
  • 5
  • Nothing in Xorg logs, not using Gnome. Cron log suggests the crash was after 07:00. Not sure what that tells me. Only normal cron.hourly being run. – Michael Eager Mar 11 '11 at 00:36
  • Definitely time to start using the diagnostic tools available from the vendor. I suspect that this could be a hardware issue. If you can push the services of the machine to another, you could then do some real testing on the components. – Mike Mar 11 '11 at 21:34
0

Not necessarily, syslog has the ability to write log messages asynchronously. Also look at sar output to find out what the hang is. It could be I/O waits, the machine could be network bound, memory bound, or CPU bound.

Sar Tutorial

Ben Lutgens
  • 351
  • 1
  • 4
0

You may just need to completely clean your motherboard. This exact same thing was happening to me; complete freeze; complete hang, with nothing in any logs, no response to mouse or keyboard, just a frozen screen and a hung CPU, completey unresponsive. Logs showed nothing.

I did a complete cleaning, which included taking out motherboard, disconnecting everything. Very, very careful cleaning. Taking off the CPU heat sink which was attached to the internal fan meant I had to re-seat the surface of the heat sink to the top of the CPU, using thermal paste, artic silver 5, I purchased at my local radio shack.

I also used pure rubbing alcohol (91%) to clean the old thermal paste off the cpu and heat sink.

I had downloaded instructions from both intel and artic silver.

It has to be very, very clean, there are very specific instructions.

Put it all baack together, per the instructions I downloaded, ran fine.

Saved me from throwing out the PC, thinking something was physically wrong with it, when it was just dirty and dusty. Underneath the fan shroud it was really a gunky mess on the motherboard. This must have been causing some short-circuiting, as dust and dirt/gunk is electrically conductive.

John
  • 1