3

we lately purchase few dell servers all of them from Rxxx series couple of R410 and R710

the OS we used on those servers is: CentOS 5.4

we're getting very weird error messages and we lost network connectivity couple of times (restarting the network interface was needed to fix it)

the messages we're getting are:
Message from syslogd@ at Wed Nov 18 12:07:08 2009 ...
servername kernel: Uhhuh. NMI received for unknown reason 20.
Message from syslogd@ at Wed Nov 18 12:07:08 2009 ...
servername kernel: Do you have a strange power saving mode enabled?
Message from syslogd@ at Wed Nov 18 12:07:08 2009 ...
servername kernel: Dazed and confused, but trying to continue

we have never seen those messages in the previous series of dell poweredge servers

do someone here using centOS 5.4 on Rxxx series? did it happen to him too?

maybe you have a suggestion about how to prevent it from happening


Update:

thanks for the info

well, i've already contact dell ofcourse they even change the motherboard in 2 of our servers

the fact i've seen those weird OS messages in more than 1 server (one R410 and the other R710) makes me think that maybe there is a conflict problem between the OS and the server

it is just doesn't make any sense that it will happen on more than 1 server, and even after motherboard replacement

dell do say they don't support centOS, i did thier DSET diagnostics and sent to them, they didnt see anything there.

all fimrwares are up to date.

sysadmin1138
  • 133,124
  • 18
  • 176
  • 300
OrenM
  • 71
  • 1
  • 7
  • I'm using CentOS 5.4 (kernel 2.6.18-164.6.1.el5 x86_64) on R300 without problem. – lg. Dec 03 '09 at 09:13

7 Answers7

3

the solution was: echo options bnx2 disable_msi=1 >> /etc/modprobe.conf /etc/init.d/network restart

i dunno if dell solved that in the last firmwares updates. but im adding these parameters to any RXXX servers that running CentOS

OrenM
  • 71
  • 1
  • 7
3

Have a look at http://kbase.redhat.com/faq/docs/DOC-16294 for a possible solution.

The solution to hangs on RHEL5.3 running the Xen Kernel and the bnx2 driver is given as editing /etc/modprobe.conf by adding the line

options bnx2 "disable_msi=1"

user27966
  • 56
  • 1
1

This definitely is a hardware related issue. Except for checking that the server's bios and bmc firmware are up to date, I'd contact Dell support and open a case.

They will probably say that CentOS is not a supported OS, but they do support RHEL5 if it was purchased as OEM, and if you can convince them that the kernel messages are hardware related, the case will be escalated to software support.

To speed things up, ask them for the diagnostic tools they have for RHEL, run them, and send in the reports gathered.

dyasny
  • 18,802
  • 6
  • 49
  • 64
0

I have just gone through a bit of hell trying to figure this one out. After replacing one R410 running Centos 5.4 with another, the exact same problem has occurred. The characteristics are:

  • after a period of time ranging from a day to 2 weeks, attempts to make TCP connections to services (incoming web & ssh) via the Broadcom network card fail with increasing frequency.
  • once problem begins, NIC drops packets
  • if left long enough, NIC can hang altogether
  • TCP connection attempts through lo do not exhibit any problems
  • active connections through NIC are not affected, only new connection attempts

Simply stopping and starting the NIC (ifdown/ifup) will reset it if hung, but a restart of the machine is needed for it to resume functioning without blocking connections or dropping packets.

Can anyone confirm that the flag 'options bnx2 "disable_msi=1"' resolves this problem? I'm reluctant to put either of these machines back in service without some assurance.

0

Have you installed all of the relevant Dell custom tools for that machine/OS combination? I think it's an IPMI issue where your machine is telling the OS something that it doesn't know how to deal with without the right drivers/tools installed.

Also try enabling or disabling HPET in your bios setup and/or grub.conf.

Chopper3
  • 101,299
  • 9
  • 108
  • 239
0

thanks for the info

well, i've already contact dell ofcourse they even change the motherboard in 2 of our servers

the fact i've seen those weird OS messages in more than 1 server (one R410 and the other R710) makes me think that maybe there is a conflict problem between the OS and the server

it is just doesn't make any sense that it will happen on more than 1 server, and even after motherboard replacement

dell do say they don't support centOS, i did thier DSET diagnostics and sent to them, they didnt see anything there.

all fimrwares are up to date.

have maybe other info about this? more ideas of what should i try to resolve this?

Thanks

OrenM
  • 71
  • 1
  • 7
0

http://www.google.com/search?q=kernel:+Uhhuh.+NMI+received+for+unknown+reason+20.

try the first result

cagenut
  • 4,848
  • 2
  • 24
  • 29
  • it is related to old kernel version, and since it is unofficial patch, i prefer avoiding that – OrenM Dec 06 '09 at 08:46