What steps should I take to best attempt to recover a failed software raid5 setup?

Question

My raid has failed, and I'm not sure what the best steps to take are in order to best attempt to recover it.

I've got 4 drives in a raid5 configuration. It seems as if one has failed (sde1), but md can't bring the array up because it says sdd1 is not fresh

Is there anything I can do to recover the array?

I've pasted below some excerpts from /var/log/messages and mdadm --examine:

/var/log/messages

$ egrep -w sd[b,c,d,e]\|raid\|md /var/log/messages

nas kernel: [...] sd 5:0:0:0: [sde]  
nas kernel: [...] sd 5:0:0:0: [sde] CDB: 
nas kernel: [...] end_request: I/O error, dev sde, sector 937821218
nas kernel: [...] sd 5:0:0:0: [sde] killing request
nas kernel: [...] md/raid:md0: read error not correctable (sector 937821184 on sde1).
nas kernel: [...] md/raid:md0: Disk failure on sde1, disabling device.
nas kernel: [...] md/raid:md0: Operation continuing on 2 devices. 
nas kernel: [...] md/raid:md0: read error not correctable (sector 937821256 on sde1).
nas kernel: [...] sd 5:0:0:0: [sde] Unhandled error code 
nas kernel: [...] sd 5:0:0:0: [sde]  
nas kernel: [...] sd 5:0:0:0: [sde] CDB: 
nas kernel: [...] end_request: I/O error, dev sde, sector 937820194
nas kernel: [...] sd 5:0:0:0: [sde] Synchronizing SCSI cache 
nas kernel: [...] sd 5:0:0:0: [sde]  
nas kernel: [...] sd 5:0:0:0: [sde] Stopping disk
nas kernel: [...] sd 5:0:0:0: [sde] START_STOP FAILED
nas kernel: [...] sd 5:0:0:0: [sde]  
nas kernel: [...] md: unbind<sde1>
nas kernel: [...] md: export_rdev(sde1)
nas kernel: [...] md: bind<sdd1>
nas kernel: [...] md: bind<sdc1>
nas kernel: [...] md: bind<sdb1>
nas kernel: [...] md: bind<sde1>
nas kernel: [...] md: kicking non-fresh sde1 from array!
nas kernel: [...] md: unbind<sde1>
nas kernel: [...] md: export_rdev(sde1)
nas kernel: [...] md: kicking non-fresh sdd1 from array!
nas kernel: [...] md: unbind<sdd1>
nas kernel: [...] md: export_rdev(sdd1)
nas kernel: [...] md: raid6 personality registered for level 6
nas kernel: [...] md: raid5 personality registered for level 5
nas kernel: [...] md: raid4 personality registered for level 4
nas kernel: [...] md/raid:md0: device sdb1 operational as raid disk 2
nas kernel: [...] md/raid:md0: device sdc1 operational as raid disk 0
nas kernel: [...] md/raid:md0: allocated 4338kB
nas kernel: [...] md/raid:md0: not enough operational devices (2/4 failed)
nas kernel: [...] md/raid:md0: failed to run raid set.
nas kernel: [...] md: pers->run() failed ...

mdadm --examine

$ mdadm --examine /dev/sd[bcdefghijklmn]1

/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4dc53f9d:f0c55279:a9cb9592:a59607c9
           Name : NAS:0
  Creation Time : Sun Sep 11 02:37:59 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027053 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e8369dbc:bf591efa:f0ccc359:9d164ec8

    Update Time : Tue May 27 18:54:37 2014
       Checksum : a17a88c0 - correct
         Events : 1026050

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : A.A. ('A' == active, '.' == missing)
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4dc53f9d:f0c55279:a9cb9592:a59607c9
           Name : NAS:0
  Creation Time : Sun Sep 11 02:37:59 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027053 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 78221e11:02acc1c8:c4eb01bf:f0852cbe

    Update Time : Tue May 27 18:54:37 2014
       Checksum : 1fbb54b8 - correct
         Events : 1026050

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : A.A. ('A' == active, '.' == missing)
/dev/sdd1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4dc53f9d:f0c55279:a9cb9592:a59607c9
           Name : NAS:0
  Creation Time : Sun Sep 11 02:37:59 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027053 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : fd282483:d2647838:f6b9897e:c216616c

    Update Time : Mon Oct  7 19:21:22 2013
       Checksum : 6df566b8 - correct
         Events : 32621

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sde1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4dc53f9d:f0c55279:a9cb9592:a59607c9
           Name : NAS:0
  Creation Time : Sun Sep 11 02:37:59 2011
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 3907027053 (1863.02 GiB 2000.40 GB)
     Array Size : 5860538880 (5589.05 GiB 6001.19 GB)
  Used Dev Size : 3907025920 (1863.02 GiB 2000.40 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e84657dd:0882a7c8:5918b191:2fc3da02

    Update Time : Tue May 27 18:46:12 2014
       Checksum : 33ab6fe - correct
         Events : 1026039

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA. ('A' == active, '.' == missing)

I've had some joy with some software from Runtime (no affil.), to recover data from failed disks. Granted, I've only used their GetDataBack product to recover a non-RAID partition, but they also have RAID Reconstructor and NAS Data Recovery, the latter of which looks a little more Linux-friendly: http://www.runtime.org/raid.htm / http://www.runtime.org/nas-recovery.htm — jimbobmcgee, May 28 '14 at 11:02
RAID5 becomes completely unsuitable with increasing drive size. A rebuild requires reading the contents of (in your case) two 2TB drives without errors to complete a successful rebuild. With a consumer grade drive's standard URE rate of 1 in 10^14 bits it is simple math to calculate that the probability of at least a single read failure over that much data is enormous. This looks like a small NAS with, I assume, consumer grade drives. Selecting RAID5 for this application is an error as it offers essentially no effective redundancy. http://tinyurl.com/2dc3amz — J..., May 28 '14 at 11:39
Just to do the math, you have 2TB * 8 bits/byte * 3 drives = 4.8E13 bits of storage to read for rebuild. With an unrecoverable read error rate of 1 in 10^14 bits, that makes a 48% probability of a RAID5 rebuild failure if the drives are near capacity. Further, if one drive has failed and they were bought together (and in service together in the same conditions) then the probability of the companions starting to fail soon thereafter is even higher. For large drives, you need to consider RAID6 - this can tolerate a complete drive failure *plus* any number of UREs during rebuild. — J..., May 28 '14 at 11:44

score 9 · Accepted Answer · answered May 27 '14 at 23:21

9

You've had a double drive failure, with one of the drives being dead for six months. With RAID5, this is irrecoverable. Replace the failed hardware and restore from backup.

Going forward, consider RAID6 with large drives like this and make sure you have monitoring in place to catch device failures so you can respond to them ASAP.

answered May 27 '14 at 23:21

EEAA

109,363
18
175
245

Well fine then, if you're gonna edit in all the useful details from my answer then I'll just delete it ;) +1 – Shane Madden May 27 '14 at 23:25
@ShaneMadden - Hah...those edits were in progress before I saw yours. I smiled when I saw the similarities, though. – EEAA May 27 '14 at 23:26
Haha, for sure. – Shane Madden May 27 '14 at 23:28
Thanks @EEAA. `smartctl` says `/dev/sdb`, `/dev/sdc` and `/dev/sdd` are all fine (`SMART overall-health self-assessment test result: PASSED`). `/edv/sde` is `FAILED`: `Attribute=Seek_Error_Rate, Type=Pre-fail` Is it true that 2 files have failed, or only one, and is there indeed **nothing** that can be done to recover? – Steve Lorimer May 28 '14 at 00:42
@SteveLorimer - `smartctl` has no knowledge of the filesystem layer, and is really to be used as a general guide. You've had two devices fail in a RAID5. That's bad. Rebuild with new hardware and restore from backup. That's going to be your fastest way to recover. – EEAA May 28 '14 at 01:07
@EEAA Ok, thanks. So you know what I'm going to say next right? No backup! >.< My bad, I know, and I will learn my lesson! In the meantime, is there *anything* I can do to recover *some* data? – Steve Lorimer May 28 '14 at 01:26
@SteveLorimer - send your drives off to a professional [data recovery service](http://www.krollontrack.com/). They'll likely be able to get all or a good portion of your data back, but be prepared, it's going to cost you dearly. – EEAA May 28 '14 at 01:28
I realise I may be clutching at straws, but it would be remiss not to ask the question, so please forgive me - is there any middle ground? Something I can do to reassemble the raid myself in order to recover some data? – Steve Lorimer May 28 '14 at 01:30
@SteveLorimer You can try `mdadm --assemble` with your device names, but beware, doing so may do more harm to whatever is left of your data, making it even more difficult for a data recovery company to recover. – EEAA May 28 '14 at 01:32
@SteveLorimer - You're welcome. Get your backups in place and tested! Not only for this device, but for all of your systems. – EEAA May 28 '14 at 01:34

S.Haran · Answer 2 · 2014-06-14T23:24:09.533

0

Well if your backups are not current you could try a forced reassembly in degraded mode using three drives...

mdadm -v --assemble --force /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sde1

And as sde1 is only slightly out of sync wrt Update Time and Event count I suspect you will be able to access most of your data. I have done this many times successfully in similar RAID5 failure scenarios.

sdb1 Update Time : Tue May 27 18:54:37 2014
sdc1 Update Time : Tue May 27 18:54:37 2014
sdd1 Update Time : Mon Oct 7 19:21:22 2013
sde1 Update Time : Tue May 27 18:46:12 2014
sdb1 Events : 1026050
sdc1 Events : 1026050
sdd1 Events : 32621
sde1 Events : 1026039

edited Jun 14 '14 at 23:24

answered Jun 14 '14 at 23:18

S.Haran

101
5

`/dev/sde1` is failed – Steve Lorimer Jun 15 '14 at 01:22
Failed as reported by smartctl is often not a full total failure. If you can still run mdadm --examine /dev/sde1 then a reassembly should be possible. In many similar cases that I have worked on recovering 99% of the data has been possible. Of course it all depends on the extent of the damage to sde1. – S.Haran Jun 15 '14 at 13:58

What steps should I take to best attempt to recover a failed software raid5 setup?

2 Answers2

Linked