Quick Description:
I have recently started trying to set up / manage a Linux (Ubuntu 10.04.2 LTS) server in our data center (all other servers are Windows boxes). The server periodically hangs and becomes unresponsive and I'm at a loss to find anything in any log that indicates a specific cause. Sometimes it's up for hours, sometimes days (14 days at longest). Plugging a monitor in to the machine after a hang shows nothing at all. In an effort to troubleshoot the problem we've tried disabling APIC, more out of "educated desperation" than anything else. Unfortunately we are limited in some of the troubleshooting we can do, as we have a single client website hosted on the box (the reason we set it up) so anything that involves significant downtime is a problem.
As this is our first attempt at setting up a linux box, we are using a "well equipped" desktop grade machine but not what I would call "server grade" hardware. This is a standalone box, not a VPS. We are using a hardware, not software, RAID array and have plenty of memory in the box.
Caveats / Background:
- I am relatively new to Linux in general.
- I spend much more time writing code than managing servers. I'm comfortable with working on the box, but I'm not really a sysadmin guy.
- I'm comfortable with the command line but have more experience with OS X (BSD).
- I am unsure of all of the tools / information / Logs that may be available, though I try to be thorough in checking what I do know.
- I did not physically configure the hardware so I'm not sure of all of the specs but I can get any info I need to troubleshoot.
- I may be skipping very basic steps or missing obvious places to look for information without knowing it.
A little more detail:
- Real memory: 8GB
- Ubuntu 10.04.2 LTS
- Hardware RAID 10
- Managing sites with Webmin version 1.550
- Server is in a remote data center. Hands on-troubleshooting is difficult.
We have attempted two Linux setups at this point. The first was on a hardware config identical to this one, but with no actual pieces of hardware reused. That attempt was using CentOS and we were attempting to set up CPanel. We scrapped that install because of this same problem (periodic crashing / hanging).
The second attempt (this one) is showing the same behavior. The only thing I can really see in common are the hardware configuration (though CentOS & Ubuntu may have more in common than I think).
The box will run fine for hours, days, or even weeks, and then just stop responding entirely. I check all of the logs I know to check (primarily messages, syslog and kern.log) but I don't see anything that seems like an error to me. I do see lines that I don't understand that may or may not be problems, such as:
rsyslogd: [origin software="rsyslogd" swVersion="4.2.0" x-pid="814" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'lightweight'.
Most of our syslog entries seem to be logs of webmin related cron jobs running. My gut tells me that there is possibly some component in our configuration Linux does not like or needs a driver update (maybe the raid card for example), but I'm unsure of how to do more to track down or determine what that might be. Guess and check is expensive.
Another thought I've had is that one or more of the cron jobs that are running are tripping something up, but it doesn't appear to be reproducible on demand and, again, I'm at a loss on how to test that theory any further. The same cron job does not appear to be running each time the server goes down.
This is a portion the log just prior to our last hang:
Aug 8 11:00:01 linhost01 CRON[10771]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug 8 11:00:01 linhost01 CRON[10772]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug 8 11:01:01 linhost01 CRON[10799]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug 8 11:05:01 linhost01 CRON[10898]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug 8 11:06:01 linhost01 CRON[10924]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug 8 11:09:01 linhost01 CRON[11007]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -type f -cmin +$(/usr/lib/php5/maxlifetime) -print0 | xargs -n 200 -r -0 rm)
Aug 8 11:10:01 linhost01 CRON[11023]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug 8 11:10:01 linhost01 CRON[11024]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug 8 11:11:01 linhost01 CRON[11063]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug 8 11:15:01 linhost01 CRON[11149]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug 8 11:16:01 linhost01 CRON[11176]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug 8 11:17:01 linhost01 CRON[11243]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 8 11:20:01 linhost01 CRON[11279]: (www-data) CMD ([ -x /usr/lib/cgi-bin/awstats.pl -a -f /etc/awstats/awstats.conf -a -r /var/log/apache2/access.log ] && /usr/lib/cgi-bin/awstats.pl -config=awstats -update >/dev/null)
Aug 8 11:20:01 linhost01 CRON[11280]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug 8 11:21:01 linhost01 CRON[11307]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
Aug 8 11:25:01 linhost01 CRON[11392]: (root) CMD (/etc/webmin/status/monitor.pl)
Aug 8 11:26:01 linhost01 CRON[11432]: (root) CMD (/etc/webmin/virtual-server/collectinfo.pl)
[SERVER DOWN AFTER THIS POINT]
If anyone can help shed any light or even give me anything else I can post here that might be helpful I would be very appreciative. I'm all for jumping in to learn by doing, but I'm starting to reach the end of my rope on this one.
Happy to post any specific log info or information that might be helpful in offering any suggestions.