4

Quick Description:

I have recently started trying to set up / manage a Linux (Ubuntu 10.04.2 LTS) server in our data center (all other servers are Windows boxes). The server periodically hangs and becomes unresponsive and I'm at a loss to find anything in any log that indicates a specific cause. Sometimes it's up for hours, sometimes days (14 days at longest). Plugging a monitor in to the machine after a hang shows nothing at all. In an effort to troubleshoot the problem we've tried disabling APIC, more out of "educated desperation" than anything else. Unfortunately we are limited in some of the troubleshooting we can do, as we have a single client website hosted on the box (the reason we set it up) so anything that involves significant downtime is a problem.

As this is our first attempt at setting up a linux box, we are using a "well equipped" desktop grade machine but not what I would call "server grade" hardware. This is a standalone box, not a VPS. We are using a hardware, not software, RAID array and have plenty of memory in the box.

Caveats / Background:

  • I am relatively new to Linux in general.
  • I spend much more time writing code than managing servers. I'm comfortable with working on the box, but I'm not really a sysadmin guy.
  • I'm comfortable with the command line but have more experience with OS X (BSD).
  • I am unsure of all of the tools / information / Logs that may be available, though I try to be thorough in checking what I do know.
  • I did not physically configure the hardware so I'm not sure of all of the specs but I can get any info I need to troubleshoot.
  • I may be skipping very basic steps or missing obvious places to look for information without knowing it.

A little more detail:

  • Real memory: 8GB
  • Ubuntu 10.04.2 LTS
  • Hardware RAID 10
  • Managing sites with Webmin version 1.550
  • Server is in a remote data center. Hands on-troubleshooting is difficult.

We have attempted two Linux setups at this point. The first was on a hardware config identical to this one, but with no actual pieces of hardware reused. That attempt was using CentOS and we were attempting to set up CPanel. We scrapped that install because of this same problem (periodic crashing / hanging).

The second attempt (this one) is showing the same behavior. The only thing I can really see in common are the hardware configuration (though CentOS & Ubuntu may have more in common than I think).

The box will run fine for hours, days, or even weeks, and then just stop responding entirely. I check all of the logs I know to check (primarily messages, syslog and kern.log) but I don't see anything that seems like an error to me. I do see lines that I don't understand that may or may not be problems, such as:

rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="814" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.

Most of our syslog entries seem to be logs of webmin related cron jobs running. My gut tells me that there is possibly some component in our configuration Linux does not like or needs a driver update (maybe the raid card for example), but I'm unsure of how to do more to track down or determine what that might be. Guess and check is expensive.

Another thought I've had is that one or more of the cron jobs that are running are tripping something up, but it doesn't appear to be reproducible on demand and, again, I'm at a loss on how to test that theory any further. The same cron job does not appear to be running each time the server goes down.

This is a portion the log just prior to our last hang:

Aug  8 11:00:01 linhost01 CRON[10771]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug  8 11:00:01 linhost01 CRON[10772]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:01:01 linhost01 CRON[10799]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:05:01 linhost01 CRON[10898]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:06:01 linhost01 CRON[10924]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:09:01 linhost01 CRON[11007]: (root) CMD (  [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm)
Aug  8 11:10:01 linhost01 CRON[11023]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug  8 11:10:01 linhost01 CRON[11024]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:11:01 linhost01 CRON[11063]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:15:01 linhost01 CRON[11149]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:16:01 linhost01 CRON[11176]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:17:01 linhost01 CRON[11243]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Aug  8 11:20:01 linhost01 CRON[11279]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug  8 11:20:01 linhost01 CRON[11280]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:21:01 linhost01 CRON[11307]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug  8 11:25:01 linhost01 CRON[11392]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug  8 11:26:01 linhost01 CRON[11432]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
[SERVER DOWN AFTER THIS POINT]

If anyone can help shed any light or even give me anything else I can post here that might be helpful I would be very appreciative. I'm all for jumping in to learn by doing, but I'm starting to reach the end of my rope on this one.

Happy to post any specific log info or information that might be helpful in offering any suggestions.

Cliff Pruitt
  • 161
  • 1
  • 5
  • can you try to reproduce the error, and in the mean time tail -f /var/log/syslog | tee error.log You will get output on the screen and it will also save a copy to error.log. provide the window output and when you reboot provide the ouput of error.log :) Also are there any warnings on boot about hardware not being recognised ? Can you also provide a detailed overview of your hardware ? – Lucas Kauffman Aug 08 '11 at 18:08
  • Well, the error is completely non-reproducible when I want it. I can use the box all day with no problems and it can run fine for hours, days, or even weeks. That said, I've had two crashes in the last 24 hours or so and can post my syslog with both crashes and reboots. – Cliff Pruitt Aug 09 '11 at 16:50
  • http://www.crayoncowboy.com/download/syslog_dconf.txt That file will show both my syslog including the crashes / reboots as well as a detailed dconf description of my hardware config. If there is any useful info that dconf does not include, I'll be happy to track it down. – Cliff Pruitt Aug 09 '11 at 16:57
  • I'll take a look it tomorrow morning – Lucas Kauffman Aug 09 '11 at 21:00
  • After several weeks we're still having problems. I've started logging info once per minute. Just before the last crash we [logged the info seen here](http://www.crayoncowboy.com/download/ubuntu.info.precrash.txt) but I see nothing there that indicates a problem. Plenty of memory, no runaway processes, temperature is fine... If anyone cares to take a look and can let me know if they see anything crazy I'm missing I'd really appreciate it. I'm kind of out of ideas at this point. ([a full log can be found here but it is a large file](http://www.crayoncowboy.com/download/ubuntu.info.txt)) – Cliff Pruitt Sep 10 '11 at 21:39
  • You mention that you had the same problem with an entirely different, unrelated (CentOS) distro. This definitely smells of a hardware problem. The most common hardware problem by far is bad RAM (particularly if you're not using ECC memory). I advise a quick run of Memtest86+, for 8GB it will take about an hour. – wazoox Feb 25 '12 at 19:16

3 Answers3

2

For the sake of completeness I figured I'd just wrap this up. After well over a year of this behavior the server finally died. Only when trying (and failing) to rebuild the RAID did we find that not one, but two hard drives were bad.

The definitive cause of this server's problems still is not known, but (with my still somewhat limited understanding of Linux) it is my suspicion that these two drives had problems for some time and trying to use the bad drives was intermittently causing the server to crash / reboot.

Our final solution was to rebuild the server from scratch using virtually the exact same configuration but with all new hardware. The only significant configuration change we made was using ext4 instead of xfs for the file system. The box has been up for several months now without issue.

I'm answering this question only because, for us, it seems like drive failure was the cause and replacing all of the hardware was the best fix for the problem. That said, I don't know that this answer will be too helpful to most people.

Cliff Pruitt
  • 161
  • 1
  • 5
1

I would post this as a comment, but I lack the reputation.

That said, the only thing that stands out from a casual review of your logs is nouveau. If it were me, I would disable nouveau. These instructions should get you there.

You may also find the following severfault posts relevant, and perhaps helpful, if for no other reason than to expose you to some of the tools available to help troubleshoot.

How can I diagnose an Ubuntu system freeze after reboot

(How) can I use syslog to diagnose mysterious crashes?

Ubuntu 10.10 Maverick Server makes system locks up at random intervals (i7 930; 12GB RAM)

Good luck!

Marty
  • 496
  • 2
  • 5
  • Thanks very much. I'll do some more reading on the posts you referenced and look into nouveau to see if it's something we should disable. As I understand it it's just related to graphics processing which we dont need for a headless server. – Cliff Pruitt Aug 18 '11 at 20:36
  • Yes, [nouveau](http://nouveau.freedesktop.org/wiki/FrontPage) is an open source (3D) driver for nVidia cards. Possibly useful to you as well, the nouveau wiki has a page dedicated to help [diagnose system hangs](http://nouveau.freedesktop.org/wiki/HangDiagnosis). – Marty Aug 18 '11 at 20:52
0

It is unclear what exactly is meant by down. I know you mentioned its remote so hands on is difficult. However, for crashes like these it is critical to know if the machine is entirely frozen. When it crashes does the console still work (by work I mean can you hit enter, does it display a password prompt, can you login)? For a machine in the data center it is a really good idea to get some sort of console into it. Here is the cheap option:

http://international.opengear.com/SD4001_Single_Port_Advanced_Device_Server_p/sd4001.htm

This requires some setup to configure the serial console part. An easier but more expensive solution would be a KVM. Once you have determined if the physical console is frozen or not during these outages it should help determine the next steps. If the physical console is also frozen there is most likely an issue with your hardware. If the box doesn't already have ECC memory you should look at testing or replacing it. It is unlikely that the console would lock up if the issue was just a driver error with a subsystem like the RAID card. If the console does respond during these outages and you can login you should try and run some commands. If the problem is fairly frequent you may just want to setup a cron job that captures the output of these every minute:

lsof -n # will list all open FDs on the system hopefully showing if something is using all the resources

netstat -an ; netstat -s # any network caused problems should show up here like running out of buffers

ps -eaf # general process pileup?

date stamp the output and then try and find the last one before the crash. If it is an issue with a subsystem it will be apparent from the output here.

polynomial
  • 4,016
  • 14
  • 24
  • Thanks very much for the suggestions. I'll take a look at the hardware. From the one instance that the box went down with someone on site, the report I have is that the server was 100% locked up. It was unresponsive to the keyboard entirely. I do not know what was on the screen (if anything) once a monitor was plugged in. – Cliff Pruitt Aug 18 '11 at 20:32
  • Oops, not used to pressing enter submitting my comment... Looks like my next step, in the absence of any console access, is going to maybe be setting up a high frequency cron job to capture some stats. Maybe that will give me an idea of any problems that are creeping up just before the crash. My biggest weakness is knowing Linux enough to know what can bring the box down & what Linux will correct or tollerate. – Cliff Pruitt Aug 18 '11 at 20:35