Do I have any chance at a cowboy rebuild of my busted Raid5 Array?

Question

Ok, I'll try to keep this quick. The data on these drives isn't mission-critical so there's no backup. Losing the data would be a bit annoying, so if I could get it back that would be neat, but if not that's fine. More than anything this seems like a good time to explore some mdadm wizardry.

I have a raid array that when it was working looked like:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdc1[4] sda1[2] sdd1[5] sdb1[3]
      2929731072 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/8 pages [0KB], 65536KB chunk

But one of the drives failed (sdc1[4]). Then during rebuild another drive failed (sdd1[5]). The Classic. But I was a bit suspicious of this second drive failure. It may have just been a power blip or something. I figured if I could put the array together with the failed sdd1[5] in read-only I could maybe get some of the data off the array still.

Now it looked like:

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdc1[4](S) sda1[2] sdd1[5](F) sdb1[3]
      2929731072 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [_UU_]
      bitmap: 8/8 pages [32KB], 65536KB chunk

unused devices: <none>

Ok, so, I wanted to ignore the failure and re-add sdd1[5], but the re-add added it as a spare instead... That's not good.

When I examine all the disks I get that they all have the same number of events, but two of them are Active device 1 and Active device 2, and the other is spare...

I tried an --assemble --force but that just put me back into the same state. What I want is some way to tell the drive it's not a spare, but I'm not sure such a tool exists. So I figure the last thing I can try is rebuilding a new array with --create and --assume-clean to see if maybe I can squeeze the last bit of data out of this. But this feels destructive if I get it wrong, and I probably only have one shot at this, so I'm looking for someone who knows more than me.

So my first question is (A) Is there any hope that this could possible work, or am I just misguided?

After that comes (B) Is there something else I could try that's a bit less drastic that has a better chance of working?

And finally (C) Assuming this is my best shot... what order do I give the disks to --create in? In the state they were listed in sdc1[4] sda1[2] sdd1[5] sdb1[3] order, but the state when they failed was _UU_, but it was sdc1[4] and sdd1[5] that failed... And in the examine a and b were listed as Active device 1 and 2 which lines up with the _UU_ thing, assuming it's zero-based, but I don't know why they'd be in that order... So if I was to run a create, how do I know what order to put the disks in and where to put the missing one? I assume I can only mess that up once, so I'd like to take my best shot if I can.

Thanks for reading!

Give this job to a professional recovery serivce. Don't attempt to fix yourself. There may be a chance, but after your attempts you can lose it. Yes, you'll have to pay, this is the price you pay for using RAID5 with hard drives without considering [all the consequences](https://www.askdbmgt.com/why-raid5-should-be-avoided-at-all-costs.html). And, honestly, I already explained on ServerFault the correct algorithm to follow if you still want to attempt it yourself; **the first step is to dump everything**, not to attempt to re-add or otherwise change the data on the drives. — Nikita Kipriyanov, Jul 29 '23 at 17:11
Does `cat /sys/block/md0/md/sync_action` shows `frozen`? If so, try issuing `echo idle > /sys/block/md0/md/sync_action`. Anyway, in this state, be prepared to consider all data as lost. If you value your data, you should first do a block-level dump of the affected disks. — shodanshok, Jul 29 '23 at 17:16
When 2 drives of a RAID5 fail, it's over. Stop wasting your time. Otherwise use professional recovery service as already mentioned. — paladin, Aug 03 '23 at 15:23

score 0 · Accepted Answer · answered Aug 07 '23 at 20:53

Everyone in the comments was telling me to use a professional recovery service, but if the data was important enough to spend that kind of money on recovery, I would have made a backup during its lifetime.

So instead this was a learning opportunity! Here's what I learned:

First of all, I was wrong in my assumption. You can call mdadm --create multiple times non-destructively. I started by calling:

sudo mdadm --create /dev/md1 --create --assume-clean -l5 -n4 -c512 /dev/sd[abc]1 missing

And then I tried to mount it. That failed, because the mount couldn't figure out the filesystem-type, which told me that I'd gotten the order wrong. My "status" lines went from "UUUU" when working to "_UUU" to "_UU_", so I figured maybe the broken drive should be first, and so I ran the same command as above but with missing /dev/sd[abc]1 instead. Turns out if I'm rewriting the metadata anyway, it doesn't matter that I'd already rewritten it! But it also told me that there wasn't any magic in --assume-clean that auto-detects the order of disks. This command is order-sensitive, I just don't have to get it right the first time. I mounted the array read-only so no spurious writes corrupted the already screwed up ordering.

This ordering allowed me to mount it, but I had obvious corruption. Some of the directories seemed intact, but others gave me IO errors just listing them. That didn't seem good... I tried spot-checking a few files and they appeared to have bands of intact data and bands of garbage. It seemed to go in-and-out with some regularity, which made me suspicious. I know that the way RAID5 works is that it stripes the data and puts a few blocks onto each disk, followed by a parity block. So if I got the order wrong, but I got the first disk right, then the magic-block would allow the FS type to be read (and mounted), but later data would be corrupted. This would also cause bands of corruption as I had, say, a good block on an intact disk, then a displaced block in the wrong order, then when we got to the missing block to recompute from parity it would be fully nonsense as it's computing against the wrong data.

So I tried a few orderings, since I had nothing to lose and found that:

sudo mdadm --create /dev/md1 --assume-clean -l5 -n4 -c512 missing /dev/sd{b,a,c}1 && sudo mount -o ro /dev/md1 /mnt/raid

Worked for me. It could list all the files, and spot-checking a few files they appeared to be intact as far as I could tell. I assume there's some corruption somewhere, but for my purposes I'd rather have some data than have it all perfect, so this was good enough for me. (I switched to {} so I could control the ordering, and by this point I'd done a restart so the old sdd1 was now sdc1). This order was chosen because, again, the "_UU_" order implied to me that the missing disk was the first one, and the failed sdd1 (now sdc1) was last, so I just had to try the two orders between "ab" and "ba" on those middle drives. Turns out that order is the definitive one, but I don't know where that layout comes from...

So, anyway, this let me copy 1.3TB of my 2.7TB before the array crashed a second time, and spot-checking a few random files from the copied data it seems to have roughly worked! I used rsync which obviously can't catch on-disk corruption, but did gracefully handle the corruption when the disk failed part-way through the transfer.

So anyway, for future readers, if the data is at all important, absolutely pay someone. But if you're like me and you have nothing to lose and want to dick around with some mdadm, this is what I've learned!

Do I have any chance at a cowboy rebuild of my busted Raid5 Array?

1 Answers1