ZFS - zpool replace never finishes

Question

I have a 5 x 3TB raidz1 array on a ubuntu 14.04.1 server. Last month, one of the drives died (audible clicking). I was able to replace the drive with zpool replace RAID <dead drive> <new drive>. That finished without issue and the pool was online and healthy again. Then another drive died. I attempted the same thing, but the pool is stuck in the following status

# zpool status
  pool: RAID
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 29.1G in 6h3m with 1028 errors on Mon Jan  5 05:35:35 2015
config:

NAME                                   STATE     READ WRITE CKSUM
RAID                                   DEGRADED     0     0 1.00K
  raidz1-0                             DEGRADED     0     0 2.01K
    ata-ST3000DM001-9YN166_Z1F15FAV    ONLINE       0     0     0
    ata-ST3000DM001-9YN166_Z1F15FCJ    ONLINE       0     0     0
    replacing-2                        DEGRADED     0     0     4
      17164957131155215254             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-9YN166_Z1F15TBH-part1
      ata-ST3000DM001-1ER166_W500JFME  ONLINE       0     0     0
    ata-ST3000DM001-1ER166_Z500765Z    ONLINE       0     0     3
    ata-ST3000DM001-1CH166_W1F1M2C6    ONLINE       0     0     0

errors: 1028 data errors, use '-v' for a list

The good news is the data is non-essential. I am not worried about the errors (the files are videos and still play fine). I have tried the following actions to remedy this, as suggested by other questions and forums.

# zpool offline RAID ata-ST3000DM001-9YN166_Z1F15TBH
cannot offline ata-ST3000DM001-9YN166_Z1F15TBH: no valid replicas

# zpool offline RAID 17164957131155215254
cannot offline 17164957131155215254: no valid replicas

# zpool detach RAID ata-ST3000DM001-9YN166_Z1F15TBH
cannot detach ata-ST3000DM001-9YN166_Z1F15TBH: no valid replicas

# zpool detach RAID 17164957131155215254
cannot detach 17164957131155215254: no valid replicas

I have also run a zpool clear RAID and zpool scrub which triggered resilvers but left the pool in the same status as above. I then tried to remove the new disk, but oddly got the same no valid replicas error.

# zpool offline RAID ata-ST3000DM001-1ER166_W500JFME
cannot offline ata-ST3000DM001-1ER166_W500JFME: no valid replicas

I am at a loss for how to proceed. It appears that the replace was successful, but zfs won't let go of the original disk.

# dkms status -v
spl, 0.6.3, 3.13.0-43-generic, x86_64: installed
zfs, 0.6.3, 3.13.0-43-generic, x86_64: installed

Update: I removed the zpool cache at /etc/zfs/zpool.cache and rebooted. Resilvering again, will report back.

Update 2: Still in the same status as above. If there is no way to finish the replace, is there any way to rebuild the pool without loosing any data?

Update 3: Here is the most recent status:

# zpool status
  pool: RAID
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 29.1G in 6h1m with 1028 errors on Wed Jan  7 03:49:13 2015
config:

    NAME                                   STATE     READ WRITE CKSUM
    RAID                                   DEGRADED     0     0 1.00K
      raidz1-0                             DEGRADED     0     0 2.01K
        ata-ST3000DM001-9YN166_Z1F15FAV    ONLINE       0     0     0
        ata-ST3000DM001-9YN166_Z1F15FCJ    ONLINE       0     0     1
        replacing-2                        DEGRADED     0     0     0
          17164957131155215254             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-9YN166_Z1F15TBH-part1
          ata-ST3000DM001-1ER166_W500JFME  ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500765Z    ONLINE       0     0     0
        ata-ST3000DM001-1CH166_W1F1M2C6    ONLINE       0     0     0

errors: 1028 data errors, use '-v' for a list

The smartctl data for all 5 drives is here.

Very wild guess. Have you tried `zpool detach RAID replacing-2`, or `zpool offline RAID replacing-2`? Or wilder yet: `zpool replace RAID replacing-2 ata-ST3000DM001-1ER166_W500JFME`. **Please keep in mind, I haven't tried myself, so you're definitely risking the data.** — Fox, Mar 16 '15 at 16:06
Sorry for the lack of line breaks in the comment, but no luck with those. `# zpool detach RAID replacing-2 cannot detach replacing-2: no such device in pool` `# zpool offline RAID replacing-2 cannot offline replacing-2: no such device in pool` `# zpool replace RAID replacing-2 ata-ST3000DM001-1ER166_W500JFME invalid vdev specification use '-f' to override the following errors: /dev/disk/by-id/ata-ST3000DM001-1ER166_W500JFME-part1 is part of active pool 'RAID'` — KernelSanders, Mar 17 '15 at 01:43

score 2 · Answer 1 · answered Jul 22 '15 at 05:58

2

Please try

zpool offline zpool detach

e.g. for the first post zpool offline RAID 17164957131155215254 zpool detach RAID 17164957131155215254

answered Jul 22 '15 at 05:58

user4845169

21
3

1

`detach` worked for me in a similar situation. But in my case I had a cascaded `spare-1, replacing-0, spare-0` where both of my hot spares ended up in the mix. I did `offline` and `detach` of the `UNAVAIL` disk. One of my spares ended up also detached. I easily added it back. – jerlich Aug 14 '19 at 04:13

score 1 · Answer 2 · edited Mar 16 '15 at 15:44

I had the exact same status:

   NAME                                              STATE     READ WRITE CKSUM
    RAIDZ0_01                                         DEGRADED     0     0     0
      raidz1-0                                        DEGRADED     0     0     0
        gptid/4fb5f83e-91b1-11e2-923c-000c292ee274    ONLINE       0     0     0
        gptid/50402028-91b1-11e2-923c-000c292ee274    ONLINE       0     0     0
        replacing-2                                   DEGRADED     0     0     0
          2345526077585836973                         UNAVAIL      0     0     0
 was /dev/gptid/72973ce8-f3bf-11e2-9759-000c292ee274
          gptid/19062bb3-c67f-11e4-8683-000c292ee274  ONLINE       0     0     0
        gptid/d69abb6b-3cd2-11e4-873f-000c292ee274    ONLINE       0     0     0
        gptid/51e62469-91b1-11e2-923c-000c292ee274    ONLINE       0     0     0
        gptid/528221a4-91b1-11e2-923c-000c292ee274    ONLINE       0     0     0
        gptid/53288697-91b1-11e2-923c-000c292ee274    ONLINE       0     0    36
        gptid/c8d9e708-cc4a-11e3-99b3-000c292ee274    ONLINE       0     0     0
    logs
      gptid/ade4947f-e365-11e3-8230-000c292ee274      ONLINE       0     0     0
    cache
      gptid/f0017430-e364-11e3-8230-000c292ee274      ONLINE       0     0     0

errors: 802342 data errors, use '-v' for a list

Besides the things you tried I also updated Freenas but all with no results. In my case, however, I was forced to remove the old drive and replace it physically by a new one. I decided to go bold and 'detached' the drive through the web GUI. This immediately changed status for the pool from 'degraded to 'online'.

ZFS - zpool replace never finishes

2 Answers2