Drive failure on VMware host in production

Question

I have a Dell PE R710 with a PERC 6/i running ESXi 5.1. It has two datastores, one of which are two SSDs in a RAID1. This morning I was called that something wasn't working. Initially, I logged into the vSphere client to see that the virtual machines were not responsive. I tried stopping all the virtual machines but nothing happened. I tried to browse the datastore, but there are not any folders/files appearing. After reading some KB articles, I ran the two commands: /etc/init.d/hostd restart -and- /etc/init.d/vpxa restart

At that time, the datastore did not appear in vSphere. Upon getting in front of the server, the following is on its LCD panel: E1810 Hard Drive Fault. So it appears that a drive has gone bad. Typically on a Windows server, I would just hot swap the drive. But with this being VMware, I'm unsure of the proper procedure. I would greatly appreciate any help!

1. Call Dell Technical Support. 2. Set up some type of hardware monitoring. — joeqwerty, Oct 21 '16 at 16:33
Unfortunately, these are non-Dell SSDs. I doubt Dell would provide support. — IT_wrench, Oct 21 '16 at 20:28
You might consider asking them just to see, if you're interested in the possibility. Couldn't hurt. — Spooler, Oct 21 '16 at 20:46

score 1 · Answer 1 · answered Oct 21 '16 at 16:56

If the data is inaccessible, you've probably lost your array due to multi-disk failure or similar. This is what happens when hardware monitoring isn't implemented and you loose too many RAID members before you're aware of it. It also happens during a more generic hardware failure, such as a flaky card.

In these situations, you're typically only alerted when your service goes down and your array integrity is damaged to the point that you need to restore from backup.

The process for swapping drives is exactly the same in your case as it is with Windows, Linux or any other OS on that box. Your hardware RAID card is handling everything. However, hotswapping may not do you any good here since your entire array is probably damaged, rather than just degraded. Assess the condition of your array before doing anything, either by using software tools such as megacli or Dell OpenManage, or by rebooting into the BIOS interface for the card and checking your array there. Also check your iDRAC for hardware logs that may indicate failures.

You're likely going to have to restore from backup in this case, because you'll likely find that either both of your SSDs have gone bad, or that your controller / backplane has gone bad (or all of the above). It would be best to restore your data to another node and take this one out of production until you can determine whether this is a multi-disk, controller, or backplane failure.

Thank you for your advice. How would I go about installing megacli on a vmware host and assessing the condition of the array? — IT_wrench, Oct 21 '16 at 20:43
This checks out: http://de.community.dell.com/techcenter/support-services/w/wiki/909.how-to-install-megacli-on-esxi-5-x — Spooler, Oct 21 '16 at 20:48

score 0 · Answer 2 · answered Nov 03 '16 at 19:09

So in the end, I ended up booting into the RAID controller config and examined the physical drive. It was marked as "Missing" and worse yet was configured for RAID0. I powered off the server and reseated the drive. Upon booting up the server, the RAID controller indicated a foreign config but I did not import it. Once ESXi booted, the datastore of the SSDs still wasn't recognized. I powered down the server. I booted it up and this time imported the foreign configure to the RAID controller. ESXi booted up and recognized the SSD datastore! Immediately I pulled off all the data.

Do I understand correctly that someone erased the controller configuration? These things don't change spontaneously, since they are stored in NVRAM with admin-only access, you know? — Martin Sugioarto, Nov 03 '16 at 19:45

Drive failure on VMware host in production

2 Answers2