Single disk failure on RAID10 making data unavailable

Question

I have an odd situation here. I have a Dell R620 with the PERC H310 mini controller. There are 2 RAID arrays, a 2 disk mirror for the OS and a 6 disk RAID 10 for the DATA drive. When a single disk fails in the RAID10 my data becomes unavailable and is listed as invalid in windows disk management. Is this normal behavior? I thought a single disk failure would simply put it in a degraded mode until a new disk is added but instead I completely lose my volume. On a side note I have had 3 disk failures in the last week. I don't think they are related issues but I could be wrong. Thanks for any assistance.

score 1 · Answer 1 · answered Apr 06 '17 at 23:04

You're absolutely correct that a single disk failure in a RAID-10 array should not result in the volume becoming unavailable. Something is likely wrong with your PERC controller.

You should get into Dell OpenManage Server Administrator or iDRAC and see if any information is reported there. You can also check the Windows Event Logs (if OSMA is installed and configured to write events to those logs).

If there are available firmware and/or driver updates for your controller or backplane, consider installing them. I would recommend doing this while the array is healthy, if at all possible.

It's also possible you have multiple failed disks. Depending on which disks in a RAID-10 set fail, you can lose more than one (up to 3 in your case) without the array going offline; however, if you lose only two, but they're in the same RAID-1 pair, then the whole array will go down.

Don't forget to contact Dell Support if your system is still in warranty. They are very good at helping diagnose issues like this.

I'm starting to think it is the controller as well. Tomorrow I am going to schedule some downtime and pull the server out and make sure everything is connected nice and tight and then do some firmware updates. It just sucks because this is a production server that is handing updates for over 1k users. Ugh — Fr0ntSight, Apr 06 '17 at 23:08

score 0 · Answer 2 · answered Apr 06 '17 at 20:28

0

Three disk failures in a week isn't an inconceivable situation, especially if they were all put into service at the same time and have the same amount of wear. However, I would begin to suspect the controller or backplane if this was happening to me.

Can you get into your iDRAC and see if there are any failures in the logs regarding the RAID controller?

Also, are you running SMART checks on the disks that have failed, and of the currently running members? That would reveal if the disk itself was bad, and would give you a clue as to how they might be failing if they are. The application smartctl is part of the smartmontools suite, and is available to install and use in a Windows environment. Refer to the man page for how to access the drives through your RAID controller, specifically the -d option.

answered Apr 06 '17 at 20:28

Spooler

7,046
18
29

With just a single drive failure right now shouldn't I still be able to access the windows volume? Instead in disk management it just says invalid. It is a dynamic volume. – Fr0ntSight Apr 06 '17 at 20:43
In a way, yes. But that's too simplistic a view. We're not putting disks into a RAID, but blocks that happen to be on disks. RAID controllers make an effort to "patrol read" to detect inconsistent block sets. However, if too much of any given set is too inconsistent, it won't be able to properly detect or correct an inconsistency. This mechanism also makes no attempt to ensure filesystem level consistency. You almost certainly have a damaged volume either because the array as a whole is inconsistent (not necessarily detectable by the controller), or you have some sort of controller problem. – Spooler Apr 06 '17 at 21:30
The controller is running a "Background Initialization" right now, 20% complete. I really hope once its finished I can get this volume back. I thought RAID 10 was supposed to be redundant so you can lose 1 disk no problem, throw in a new one and you are good to go and even with the 1 disk failed or removed I would think it would still operate in a degraded mode. – Fr0ntSight Apr 06 '17 at 22:51
Only if the volume is healthy and you don't have invalid blocks in places other than the disk you lost. It can actually lose enough disks to reduce it to a single stripe, typically meaning a maximum of two in a four member array. But that tripe must be fully intact for the array to remain valid. I would suggest reading the SMART data from your disks and looking at logs rather than making assumptions since you have access to the whole thing. – Spooler Apr 06 '17 at 23:57

Single disk failure on RAID10 making data unavailable

2 Answers2