ESXi v5.5 is having random crashes

Question

HW: Type: HP Proliant ML350 G5 RAM 22GB CPU 1 x Intel Xenon E5405 2.00GHz

OP: ESXi 5.5 just updated from 5.1 to try and fix the crashes occurring on ESXi 5.1 on same hardware.

I'm trying to find the error on why one of our servers is crashing, it has had two lock ups in 24 hours now. The internal error light on the front is blinking red, on the inside only "#5 and #6 page 76 manual" the "Processor 2" light "amber" and the "Power" light "green" is shining.

in the logs the only errors i can see in the relevant time frame is in log under. Is this the reason? or is there anything else i can do to try and log/locate the error.

from zcat syslog.6.gz | less

2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:55:47Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:55:47Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:47Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:53Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:55:57Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:01Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:04Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:15Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set recv timeout (30) for socket -1. Errno = 9
2014-05-26T11:56:17Z sfcbd[35064]: Failed to set timeout for local socket (e.g. provider)
2014-05-26T11:56:17Z sfcbd[35064]: spGetMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcbd[35064]: rcvMsg receiving from -1 35064-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:17Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:23Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:27Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:31Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:34Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:44Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:46Z sfcb-ProviderManager[34828]: SendMsg sending to 1 34828-9 Bad file descriptor
2014-05-26T11:56:48Z sfcbd[35064]: Error opening socket pair for getProviderContext: Too many open files

Update

Setting up iLO 2 and geting access to the logs did show som progress, i was getting lots of Power removed messages. So i startesd to suspect the power, and after removing the UPS the server has been stable now for 5 days.

Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power restored.
Informational
iLO 2
05/29/2014 20:31
05/29/2014 20:31
1
Server power removed.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power restored.
Informational
iLO 2
05/29/2014 16:57
05/29/2014 16:57
1
Server power removed.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power restored.
Informational
iLO 2
05/29/2014 15:39
05/29/2014 15:39
1
Server power removed.

Update 2

Still not stable crashed again 2 times in 24 houers now

same in logs

Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
2
Server power removed.
Informational
iLO 2
06/13/2014 05:21
06/13/2014 05:21
3
Server power restored.

the iLO interface stays up after this happens. the IML log in Empty does not show anything

enter image description here

UPDATE 3

Status Summary  
    Server Name:    esx01.xx.xx; ProLiant ML350 G5
UUID:   32393534-3937-5A43-4A38-353130393248
Server Serial Number / Product ID:  CZJ851092H / 459279-425
System ROM:     D21  11/02/2008; backup system ROM: 11/02/2008
System Health:   Ok
Internal Health LED:     Ok
Server Power:   
 ON
UID Light:  
 OFF
Last Used Remote Console:       
Remote Console
Latest IML Entry:       IML Cleared (iLO 2 user:xxx)
iLO 2 Name:     ILOCZJ851092H
License Type:   iLO 2 Standard
iLO 2 Firmware Version:     1.61   08/31/2008
IP address:     192.168.2.2
Active Sessions:    iLO 2 user:xxx
Latest iLO 2 Event Log Entry:   Browser login: xxx - 172.20.1.105(DNS name not found).
iLO 2 Date/Time:    06/13/2014 23:22:52

At least not the ever-present cleaning lady pulling the server's plug to get at the socket for her vaccum cleaner. — the-wabbit, Jun 10 '14 at 09:01
Just as a sidenote: When my server starts crashing *suddenly*, the last thing I'd do is upgrade the OS "just in case" ;-) — Marki, Jun 13 '14 at 09:31
@Darkmage Your system may be experiencing a real crash. The ASR on the ProLiant may be kicking in. Please look at the IML "Integrated Management Log" in the ILO interface. It's right under the ILO log. Also, can you tell me the firmware/BIOS revision of the server? — ewwhite, Jun 13 '14 at 11:29
@ewwhite the IML log is empty, nothing is being logged inside there. ill get the firmware and Bios version as the server is off site. — Darkmage, Jun 13 '14 at 12:11
@Darkmage Just look in ILO Summary tab to obtain the BIOS information. — ewwhite, Jun 13 '14 at 12:15
@Darkmage Please see my update in my answer below. You desperately need to update the firmware on this server. — ewwhite, Jun 13 '14 at 21:34
been stable now for 5 days, I am getting ready to say it healthy again. — Darkmage, Jun 18 '14 at 13:57

score 7 · Accepted Answer · edited Mar 17 '17 at 10:13

7

You likely have a hardware problem. This is not an issue with VMware ESXi.

Which build number of ESXi are you on?
What firmware revision is the server hardware/BIOS on?
Is the other ESXi host you mentioned comprised of the same hardware?

Your best bet is to examine the HP Integrated Management Log (IML) of the server. You can do this through the ILO 2 interface.

Log onto the ILO, check the hardware system status tab. That main summary screen will probably tell you what's wrong.
Additionally, take a look at the IML option under the "System Status" tab. This will tell you why the server crashed.

That's all. You may have a RAM, CPU or system board issue here.

Edit: Update your host's firmware, please!! - Don't become a statistic!

The download for the current bootable firmware DVD for your system is here. Please boot your system with that and let it update all of the components. Everything on that server looks like it dates back to 2008. That's a BIG no-no when working with HP server hardware.

edited Mar 17 '17 at 10:13

Community

1

answered May 27 '14 at 11:59

ewwhite

197,159
92
443
809

Actually it could well be a driver issue, I had similar lockups after updating from 5.1 to 5.5, it had not updated the drivers correctly. – JamesRyan May 27 '14 at 13:39
1

It's not a driver issue. – ewwhite May 27 '14 at 13:46
What do you know that isn't listed in this question? Because the drivers can and did in my case cause false cpu/memory error reported. – JamesRyan May 27 '14 at 14:13
1

@JamesRyan I know HP ProLiant servers and VMware. I also understand the state of the internal health LEDs on this particular hardware. The crashes resisted through the ESXi host upgrade. The IML logs should be examined, as stated in my answer. – ewwhite May 27 '14 at 14:20
This is a good place to start but to blindly rule out all other possibilities is foolish. I have given a direct example of when both the LEDs and the health log lied. – JamesRyan May 27 '14 at 14:33
3

@JamesRyan The HP IML log won't *"lie"*. The internal health LED on this server isn't triggered by software, and there's nothing buggy about the driver sets unique to ProLiant hardware under ESXi - This is straight Broadcom tg3, Intel chipset and HP CCISS array componentry. Those are solid drivers with a deep install base. Errors within wouldn't cause what the OP is seeing. – ewwhite May 27 '14 at 14:41
ah ok you couldn't possibly be mistaken, I must have been dreaming when I saw this happen in front of my eyes. – JamesRyan May 27 '14 at 14:46
2

`o/~ You know my logs don't lie And I'm starting to feel it's right o/~` – Tom O'Connor May 27 '14 at 14:48
When you move a harddrive from one server to another and the 'cpu error' moves, you can be pretty sure the log is lieing and the cpus didn't magically switch places. :) – JamesRyan May 27 '14 at 14:49

ESXi v5.5 is having random crashes

Update

Update 2

UPDATE 3

1 Answers1