How can you force linux software raid not to disable a disk for recovery?

Question

I'm trying to recover data from a RAID5 array. 2 of my 4 disks unexpectedly failed at the same time. I am able to start the array by forcing it.

mdadm --assemble --scan --force

The array starts up ckean but degraded

root@omv:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Apr 18 22:03:46 2012
     Raid Level : raid5
     Array Size : 8790795264 (8383.56 GiB 9001.77 GB)
  Used Dev Size : 2930265088 (2794.52 GiB 3000.59 GB)
   Raid Devices : 4
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Mon Aug 25 23:50:44 2014
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : omv:data  (local to host omv)
           UUID : 157604ce:9206dd99:c8d249be
         Events : 21524

    Number   Major   Minor   RaidDevice State
       4       8       16        0      active sync   /dev/sdb
       1       0        0        1      removed
       2       8       32        2      active sync   /dev/sdc
       3       8       48        3      active sync   /dev/sdd

I proceed to mound the file system in read only mode. The read errors eventually result in the device being dropped from the array. Is there a way I can force it to not to be dropped. I'd like to be able to copy off what I can.

[  190.250032] end_request: I/O error, dev sdc, sector 1234525616
[  190.250082] raid5:md0: read error not correctable (sector 1234525616 on sdc).
[  190.250086] raid5: Disk failure on sdc, disabling device.
[  190.250088] raid5: Operation continuing on 2 devices.
[  190.250195] ata5: EH complete
[  190.366679] Buffer I/O error on device md0, logical block 462946358
[  190.366723] lost page write due to I/O error on md0
[  192.873263] ata5.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x0
[  192.873308] ata5.00: irq_stat 0x40000008
[  192.873348] ata5.00: failed command: READ FPDMA QUEUED
[  192.873392] ata5.00: cmd 60/10:00:00:dc:3c/00:00:57:00:00/40 tag 0 ncq 8192 in
[  192.873394]          res 41/40:10:00:dc:3c/00:00:57:00:00/00 Emask 0x409 (media error) <F>
[  192.873476] ata5.00: status: { DRDY ERR }
[  192.873514] ata5.00: error: { UNC }

score 2 · Answer 1 · answered Aug 26 '14 at 07:26

2

You should take images of all the RAID member drives with a tool like dd_rescue, and then assemble a RAID volume with these images.

This way you don't put any extra stress to the failed hard disks, and you have the best chance to recover data.

answered Aug 26 '14 at 07:26

Tero Kilkanen

36,796
3
41
63

This strategy will resolve the issue provided you can duplicate the data. The question I had was whether you could force the raid driver, not to disable the disk. – Marinus Aug 27 '14 at 12:09
It doesn't appear that it's possible to force the array not to disable the device. Recovering the disk and starting the appears to be the only viable option as Tero Kilkanen metioned. – Marinus Sep 09 '14 at 08:47

score 0 · Answer 2 · answered Aug 26 '14 at 08:40

The problem is, that the ext2/3/4 filesystem doesn't know anything from the underlying device. If there is a bad block on some of the raid devices, it will cause 2 different, independent results:

the raid subsystem will disable the entire disk, and put the array in degraded mode
the filesystem will sign a read error (in the case of a mirror, it is not always so).

I have an opinion, which is considered mostly heretic in "professional system administrator" circles. On this opinion,

a disk with some bad blocks isn't coming from the devil,
if you have 45634563563456 good sector on a hdd, maybe you could be able to handle the 3 bad between them.

The problem is, that the underlying raid mechanism doesn't know anything about the bad blocks on the disk as well. It will try to do sync even to the bad blocks.

If you have a raid, the simplest solution were to get a new hard disk. This one can you use for other tasks as well, but not for raid. Ext2/3/4 filesystem has a very good bad block handling.

If you want to use further as a raid member device, it is possible, although not so simple. In this case, you can do some tricky lvm-based solution - some of them can handle, and manage volumes even on a disk with bad blocks. I suggest you to try with the bad block relocation module of the device mapper.

@Marinus I don't use that, because I don't use raid any more. Instead of that, I use filesystem-based cluster solutions. When I used, I used the dmsetup tool to do all of my such tasks. — peterh, Aug 26 '14 at 08:49

How can you force linux software raid not to disable a disk for recovery?

2 Answers2