1

Here is the situation: I am running CentOS 5.7 x86_64 with Xen 3.0.3 (xen-3.0.3-132.el5_7.2.x86_64) and the Xen kernel (2.6.18-274.12.1.el5xen). The server has 8GB RAM, and an i7-950 @ 3.07Ghz. I am using it to host two guests - Windows Server 2008 R2 and CentOS 6.2 x86_64 - both using full-virtualization and LVM partitions.

For the last week or so, this server has been crashing 5-10 times a day, sometimes mere minutes after the last boot. Nothing about the machine has been changed, and no new software has been installed (I've been using this kernel/Xen version for about 3-4 weeks before with no problems).

The machine is running perfectly, then just stops - nothing on the console, nothing (that I can see) in the logs. It has to be rebooted by powering down then up again, and sometimes it will happen again within just a few minutes. A full hardware check was run a little over a week ago, and everything came back clean. Using e2fsck did fix a couple of issues, but hasn't actually resolved the situation (if anything, it now seems to crash more regularly).

When I was booted into a live CD last night to run the e2fsck, it ran fine for ~8 hours without any crashes (which it probably wouldn't do under the CentOS install on the drive). It sounds more and more like a software issue, but it's difficult to pin down, seeing as no configurations have been changed, and no new software has been installed.

I've checked all the system logs, and nothing seems out of place. I've also done a full check of each partition using e2fsck. I've linked the pastebinned logs below, but I just can't figure it out.

/var/log/messages: http://pastebin.com/CNkf73sN

/var/log/dmesg: http://pastebin.com/r2Hx9uij

Any help on this would be much appreciated. Thanks in advance.

Josh
  • 146
  • 4
  • You have 2 network cards, a broadcom and a realtek one? Can you tell us more about the server? There is known issue with Dell Openmanage drivers, Xen and the Broadcom bnxII driver. – Olivier S Feb 18 '12 at 12:37
  • Just an onboard Realtek NIC, using the drivers available on the CentOS installation media. I have read that it may be better to use the official Realtek driver (available from elrepo as kmod-r8168). I haven't looked a great deal yet, only came across it a short while ago - I'm not sure if the r8168 drivers will work with the r8169 my board has. What other information do you need about the server itself? – Josh Feb 18 '12 at 12:44
  • I would try to edit the grub conf, add a dom0_mem parameter like this: kernel /xen.gz-3.3.1 dom0_mem=1024000 ( or whatever value you want to reserve for dom0 ). I already had issue with xen when it tries to take memory from the dom0 to give it to a domU . – Olivier S Feb 18 '12 at 13:02
  • I'm using `dom0_mem=700M acpi=ht` on it already (added acpi due to crashes shortly after initial install, trying to run Windows HVM). – Josh Feb 18 '12 at 13:04
  • why is your bridge named "br0"? Usually on CentOS the default bridge is xenbr0". Also, when it was running for 8 hours on the live CD, did you start the hvm guests? Did you check if stopping one of the guests solved the problem? – Olivier S Feb 18 '12 at 13:23
  • I setup the bridge manually, after having some trouble with the automated process on a previous installation. I didn't start the guests from the live instance - it might be worth checking whether one of those causes the crash though. They were running for nearly a month without issue (and it shouldn't, because that makes Xen almost useless), but perhaps it is. – Josh Feb 18 '12 at 13:26
  • Ok, so stopping the Windows guest solves the crashing problem... but surely a guest shouldn't be able to crash the Xen host? – Josh Feb 19 '12 at 20:28
  • yes, this should not happen. Did you check the windows guest does not try to use the VT extensions on the processor ( running Hyper-V for example ) ? Also which drivers ( io and network ) do you use? The PV drivers give a huge performance boost but I had some surprises on windows xp, some releases were not stable. – Olivier S Feb 20 '12 at 19:26
  • Had the same exact problem. One of three identical servers would do the same thing, also with centos5.7/64. Even after re-installs. Gave up and upgraded to centos6 and fixed the problem. Moved the vm's back to it and they've been running perfectly. Not a solution, but a data point. – Chris Kaufmann May 16 '12 at 21:55

0 Answers0