How to recover from raid 5 failure of 2 disks with tw_cli?

Question

I have hardware raid 5 of 12 disks, 2 of them died and the data is not accessible anymore. I was told that even though 2 disks died, some of the data might be recoverable. My hosting provider replaced the bad disks with new ones (at start they replaced functioning disk with new one, but now all in place).

I'm using tw_cli and I guess that now I need to "rebuild" to array, but I'm afraid of doing mistakes. I didn't find any step-by-step guide for such case with tw_cli.

Can you please advise, what should be done now and what is the exact commands with tw_cli?

#tw_cli /c0/u0 show

Unit     UnitType  Status         %Cmpl  Port  Stripe  Size(GB)  Blocks
-----------------------------------------------------------------------
u0       RAID-5    INOPERABLE     -      -     256K    20489     42968510464 
u0-0     DISK      DEGRADED       -      -     -       1862.63   3906228224  
u0-1     DISK      OK             -      p1    -       1862.63   3906228224  
u0-2     DISK      OK             -      p2    -       1862.63   3906228224  
u0-3     DISK      OK             -      p3    -       1862.63   3906228224  
u0-4     DISK      OK             -      p4    -       1862.63   3906228224  
u0-5     DISK      OK             -      p5    -       1862.63   3906228224  
u0-6     DISK      OK             -      p6    -       1862.63   3906228224  
u0-7     DISK      OK             -      p7    -       1862.63   3906228224  
u0-8     DISK      OK             -      p8    -       1862.63   3906228224  
u0-9     DISK      OK             -      p9    -       1862.63   3906228224  
u0-10    DISK      OK             -      p10   -       1862.63   3906228224  
u0-11    DISK      DEGRADED       -      -     -       1862.63   3906228224

OS: CentOS

UPDATE: As @Overmind suggested, I've inserted the disks again, it said rebuilding, now it says inoperable but 11 disks out of 12 is OK!!

I replaced the bad disk (p0) with a new one and tried to rebuild but it failed because device is busy. any idea what should I do?

tw_cli /c0/u0 start rebuild disk=0
Sending rebuild start request to /c0/u0 on 1 disk(s) [0] ... Failed.

(0x0B:0x0033): Unit busy

I tried to umount the folder on this raid array but it didn't help. In the manual I read that I should mark the disk as spare so I did it but I'm afraid I got bad results, I really need your help here.

tw_cli /c0 add type=spare disk=0
Creating new unit on controller /c0 ...  Done. The new unit is /c0/u1.

# tw_cli /c0 show

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    INOPERABLE     -      256K    20489     OFF    ON       OFF      
u1    SPARE     OK             -      -       1863.01   -      OFF      -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u1     1.82 TB     3907029168    9WM0XF4D      
p1     OK               u0     1.82 TB     3907029168    53SB7TLAS     
p2     OK               u0     1.82 TB     3907029168    53SDBSXAS     
p3     OK               u0     1.82 TB     3907029168    53SB7UJAS     
p4     OK               u0     1.82 TB     3907029168    53SB7SGAS     
p5     OK               u0     1.82 TB     3907029168    53SB8BPAS     
p6     OK               u0     1.82 TB     3907029168    53VDW0PGS     
p7     OK               u0     1.82 TB     3907029168    53SDAHTAS     
p8     OK               u0     1.82 TB     3907029168    53SB7U3AS     
p9     OK               u0     1.82 TB     3907029168    53SB7UBAS     
p10    OK               u0     1.82 TB     3907029168    53VE7D5AS     
p11    OK               u0     1.82 TB     3907029168    43N2SNDGS     

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           Yes       OK        OK       OK       0      xx-xxx-xxxx

Thank you, we're always looking for examples of why nobody should be using RAID 5 in 2015 - this will hopefully help many others in the future from emulating your mistake. — Chopper3, Jan 28 '15 at 12:10
There's only two options these days R1/10 or R6 - we get so many people on here asking why but just do a search here, you're not the first one to be in this position but we're doing our best to educate people so that maybe you're the last. — Chopper3, Jan 28 '15 at 12:19

score 2 · Answer 1 · answered Jan 28 '15 at 14:13

3Ware controllers are nice - no doubt about that. But as noted above, RAID 5 with many disks is a real problem. If the disks are completely dead and gone, I would say you have no way of recovering, short of using a data recovery tool like this:

https://www.runtime.org/raid.htm

I have tried recovering data for customers (long time ago) and it is at best ridiculously time consuming. Even with the proper tools, with two disks gone, some data is irrecoverably lost. If just one of the two disks can be somewhat recovered, you might be in luck. That would allow reconstruction and as far as I recall, the 3Ware stuff is reasonably good at it.

All things considered, I hate to agree with the previous posters, but with two disks gone (and with that good disk having been replaced too), I would say your chances are pretty slim.

Given the relatively low disk prices these days (not including SSDs), go for at least RAID 6 with a hot spare next time. The best option is RAID 10 with hot spare(s) as it gives you (up to) 50% failure tolerance and great speed on top.

Overmind · Answer 2 · 2015-01-28T12:04:17.360

1

Did they fail on the exact same time ? What do you mean by "disks died" ? Are they mechanically termianted or only have some corruption on them ?

Anyway, you have a double disk failure on RAID 5. This means your data is gone. The array cannot be rebuilt.

At that many disks it was logical to have a RAID6 so it would protect against 2 disk failures in the same time.

The only way you could of saved the array was if you would of replaced the fist failed disk and rebuilt the array before the second failure.

If one is still relatively functional you could re-insert it into the RAID and try a rebuild from there (/c0/u0 start rebuild disk=p) and if successful, replace it afterwards and run a second rebuild.

If original drives are not mechanically broken, but them back (both) and run /c0 u1 remove /c0 u11 remove and then /c0 rescan. That could re-add at least one of them to the RAID if alive-enough.

Note that the c0/u0/p notations are dependent on the CLI version and sys configuration.

edited Jan 28 '15 at 12:04

answered Jan 28 '15 at 11:57

Overmind

3,076
2
16
25

I don't know if they failed on the same time because I didn't monitor the raid status, and how could I know if one has failed? I guess that if one failed, the RAID is still operating and I couldn't know about that. The disks were in READ-TIMEOUT status, does that mean they are mechanically broken? – Niros Jan 28 '15 at 12:03
MK broken mean totally unaccesible (like in fried with smoke or making click sounds when powering up). See the above edit. Try to add the originals back and rebuild with them. Maybe one is alive enough to save your RAID. – Overmind Jan 28 '15 at 12:06
I don't have physical access to the HDDs since they belong to the hosting provider, I will tell them to put the disks back and I'll try to rebuild as you said. What is the exact command I need to run after they insert the "broken" disks? Is it tw_cli /c0/u0 start rebuild disk=p? tw_cli show ver: CLI Version = 2.00.03.013 API Version = 2.00.00.087 – Niros Jan 28 '15 at 12:09
The logical order of commands is to remove the 2 borken ones from array (use remove command for u1 and u11), then rescan (at this point they will be re-detected) and finally rebuild. For detailed info on parameters do check the CLI help, because command structure it too version-dependent. – Overmind Jan 28 '15 at 12:14
They inserted the old HDDs and now it says status: rebuilding! I didn't execute any command and it changed to rebuilding without any manual intervention. Should I just wait or should I execute the "start rebuild" command? – Niros Jan 28 '15 at 14:01
1

@Niros, if the rebuild actually completes, you have totally dodged a big bullet. Take the server out of service. Don't touch **anything** on it. Wait for that rebuild to complete. If it does, back up all the data on that RAID array, replace the once-failed discs, recreate it RAID0+1, and restore the data. **Then** find out how to monitor that particular hardware RAID so that you get more warning of a disc failure than complete array failure. Then review your backup strategy, which is unfit for purpose. Then go and buy a stack of lottery tickets, for it is clearly your *very* lucky day. – MadHatter Jan 28 '15 at 14:23
LOL thanks, I've just updated the main post about the current status. it's not rebuilding anymore. waiting for your great instructions on next steps Will wait with the lottery tickets for now :) – Niros Jan 28 '15 at 14:25
11 of 12 OK means your data is saved. The critical part is gone. Lucky you :D – Overmind Jan 30 '15 at 10:06

How to recover from raid 5 failure of 2 disks with tw_cli?

2 Answers2