mdadm: drive replacement shows up as spare and refuses to sync

Question

Prelude

I had the following devices in my /dev/md0 RAID 6: /dev/sd[abcdef]

The following drives were also present, unrelated to the RAID: /dev/sd[gh]

The following drives were part of a card reader that was connected, again, unrelated: /dev/sd[ijkl]

Analysis

sdf's SATA cable went bad (you could say it was unplugged while in use), and sdf was subsequently rejected from the /dev/md0 array. I replaced the cable and the drive was back, now at /dev/sdm. Please do not challenge my diagnosis, there is no problem with the drive.

mdadm --detail /dev/md0 showed sdf(F), i.e., that sdf was faulty. So I used mdadm --manage /dev/md0 --remove faulty to remove the faulty drives.

Now mdadm --detail /dev/md0 showed "removed" in the space where sdf used to be.

root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 21:16:14 2015
          State : active, degraded
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 67205

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       4       0        0        4      removed
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

For some reason the RaidDevice of the "removed" device now matches one that is active. Anyway, let's try add the previous device (now known as /dev/sdm) because that was the original intent:

root@galaxy:~# mdadm --add /dev/md0 /dev/sdm
mdadm: added /dev/sdm
root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 21:19:30 2015
          State : active, degraded
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 67623

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       4       0        0        4      removed
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

       6       8      192        -      spare   /dev/sdm

As you can see, the device shows up as a spare and refuses to sync with the rest of the array:

root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6](S) sdb[5] sda[0] sde[4] sdd[3] sdc[1]
      15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices:

I have also tried using mdadm --zero-superblock /dev/sdm before adding, with the same result.

The reason I am using RAID 6 is to provide high availability. I will not accept stopping /dev/md0 and re-assembling it with --assume-clean or similar as workarounds to resolve this. This needs to be resolved online, otherwise I don't see the point of using mdadm.

score 11 · Accepted Answer · answered Mar 17 '15 at 09:17

After hours of Googling and some extremely wise help from JyZyXEL in the #linux-raid Freenode channel, we have a solution! There was not a single interruption to the RAID array during this process - exactly what I needed and expected from mdadm.

For some (currently unknown) reason, the RAID state became frozen. The winning command to figure this out is cat /sys/block/md0/md/sync_action:

root@galaxy:~# cat /sys/block/md0/md/sync_action
frozen

Simply put, that is why it was not using the available spares. All my hair is gone at the cost of a simple cat command!

So, just unfreeze the array:

root@galaxy:~# echo idle > /sys/block/md0/md/sync_action

And you're away!

root@galaxy:~# cat /sys/block/md0/md/sync_action
recover
root@galaxy:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdm[6] sdb[5] sda[0] sde[4] sdd[3] sdc[1]
      15627548672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      [>....................]  recovery =  0.0% (129664/3906887168) finish=4016.8min speed=16208K/sec
      bitmap: 17/30 pages [68KB], 65536KB chunk

unused devices: 
root@galaxy:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Jul 30 13:17:25 2014
     Raid Level : raid6
     Array Size : 15627548672 (14903.59 GiB 16002.61 GB)
  Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
   Raid Devices : 6
  Total Devices : 6
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Mar 17 22:05:30 2015
          State : active, degraded, recovering
 Active Devices : 5
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : eclipse:0
           UUID : cc7dac66:f6ac1117:ca755769:0e59d5c5
         Events : 73562

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       1       8       32        1      active sync   /dev/sdc
       6       8      192        2      spare rebuilding   /dev/sdm
       3       8       48        3      active sync   /dev/sdd
       4       8       64        4      active sync   /dev/sde
       5       8       16        5      active sync   /dev/sdb

Bliss :-)

Thanks for that. Deserved to be bookmarked :-) I'm not having this kind of problem in particular but something similar. I've bought a pair of disks and I'm trying to add them to an existing RAID6 array with two faulty disks. No data loss at this time! :-) ... One of the disks was added OK but the other one is reported as faulty and automagically removed from the array. S.M.A.R.T. does not report anything wrong with the brand new disk... so, I'm still trying to figure out why the disk is refused. I'm doing full tests with the new disk in order to stress it and see if SMART reports anything. — Richard Gomes, Nov 24 '19 at 11:07

mdadm: drive replacement shows up as spare and refuses to sync

1 Answers1

Linked