5

HW: Type: HP Proliant ML350 G5 RAM 22GB CPU 1 x Intel Xenon E5405 2.00GHz

OP: ESXi 5.5 just updated from 5.1 to try and fix the crashes occurring on ESXi 5.1 on same hardware.

I'm trying to find the error on why one of our servers is crashing, it has had two lock ups in 24 hours now. The internal error light on the front is blinking red, on the inside only "#5 and #6 page 76 manual" the "Processor 2" light "amber" and the "Power" light "green" is shining.

in the logs the only errors i can see in the relevant time frame is in log under. Is this the reason? or is there anything else i can do to try and log/locate the error.

from zcat syslog.6.gz | less

2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:53Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:57Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:01Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:04Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:15Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:23Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:27Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:31Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:46Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:48Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files

Update

Setting up iLO 2 and geting access to the logs did show som progress, i was getting lots of Power removed messages. So i startesd to suspect the power, and after removing the UPS the server has been stable now for 5 days.

Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power restored.
Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power removed.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power restored.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power removed.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power restored.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power removed. 

Update 2

Still not stable crashed again 2 times in 24 houers now

same in logs

Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
2
Server power removed.
Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
3
Server power restored.

the iLO interface stays up after this happens. the IML log in Empty does not show anything

enter image description here


UPDATE 3

Status Summary  
    Server Name:    esx01.xx.xx; ProLiant ML350 G5
UUID:   32393534-3937-5A43-4A38-353130393248
Server Serial Number / Product ID:  CZJ851092H / 459279-425
System ROM:     D21  11/02/2008; backup system ROM: 11/02/2008
System Health:   Ok
Internal Health LED:     Ok
Server Power:   
 ON
UID Light:  
 OFF
Last Used Remote Console:       
Remote Console
Latest IML Entry:       IML Cleared (iLO 2 user:xxx)
iLO 2 Name:     ILOCZJ851092H
License Type:   iLO 2 Standard
iLO 2 Firmware Version:     1.61   08/31/2008
IP address:     192.168.2.2
Active Sessions:    iLO 2 user:xxx
Latest iLO 2 Event Log Entry:   Browser login: xxx - 172.20.1.105(DNS name not found).
iLO 2 Date/Time:    06/13/2014 23:22:52 
Darkmage
  • 323
  • 3
  • 12

1 Answers1

7

You likely have a hardware problem. This is not an issue with VMware ESXi.

  • Which build number of ESXi are you on?
  • What firmware revision is the server hardware/BIOS on?
  • Is the other ESXi host you mentioned comprised of the same hardware?

Your best bet is to examine the HP Integrated Management Log (IML) of the server. You can do this through the ILO 2 interface.

  • Log onto the ILO, check the hardware system status tab. That main summary screen will probably tell you what's wrong.
  • Additionally, take a look at the IML option under the "System Status" tab. This will tell you why the server crashed.

That's all. You may have a RAM, CPU or system board issue here.

enter image description here


Edit: Update your host's firmware, please!! - Don't become a statistic!

The download for the current bootable firmware DVD for your system is here. Please boot your system with that and let it update all of the components. Everything on that server looks like it dates back to 2008. That's a BIG no-no when working with HP server hardware.

ewwhite
  • 197,159
  • 92
  • 443
  • 809
  • Actually it could well be a driver issue, I had similar lockups after updating from 5.1 to 5.5, it had not updated the drivers correctly. – JamesRyan May 27 '14 at 13:39
  • 1
    It's not a driver issue. – ewwhite May 27 '14 at 13:46
  • What do you know that isn't listed in this question? Because the drivers can and did in my case cause false cpu/memory error reported. – JamesRyan May 27 '14 at 14:13
  • 1
    @JamesRyan I know HP ProLiant servers and VMware. I also understand the state of the internal health LEDs on this particular hardware. The crashes resisted through the ESXi host upgrade. The IML logs should be examined, as stated in my answer. – ewwhite May 27 '14 at 14:20
  • This is a good place to start but to blindly rule out all other possibilities is foolish. I have given a direct example of when both the LEDs and the health log lied. – JamesRyan May 27 '14 at 14:33
  • 3
    @JamesRyan The HP IML log won't *"lie"*. The internal health LED on this server isn't triggered by software, and there's nothing buggy about the driver sets unique to ProLiant hardware under ESXi - This is straight Broadcom tg3, Intel chipset and HP CCISS array componentry. Those are solid drivers with a deep install base. Errors within wouldn't cause what the OP is seeing. – ewwhite May 27 '14 at 14:41
  • ah ok you couldn't possibly be mistaken, I must have been dreaming when I saw this happen in front of my eyes. – JamesRyan May 27 '14 at 14:46
  • 2
    `o/~ You know my logs don't lie And I'm starting to feel it's right o/~` – Tom O'Connor May 27 '14 at 14:48
  • When you move a harddrive from one server to another and the 'cpu error' moves, you can be pretty sure the log is lieing and the cpus didn't magically switch places. :) – JamesRyan May 27 '14 at 14:49