Can I remove a disk from a ZFS stripe?

Question

I have a large pool on a system. The system is a storage node in a Hadoop cluster, so a stripe is fine because if we lose the local pool we can reconstruct the data at the cluster level.

A disk is going bad: can I tell ZFS to try and move the storage blocks off the device so I can remove it, or do I need to delete the entire pool and rebuild it? Ideally, I can pull the disk out, and later add a new disk when I swap out the failed hardware.

I assume the answer is "no" because a traditional RAID operates at a block level, but maybe a ZFS storage pool is smart enough to at least try to relocate the file data.

> sudo zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank  19.9T  8.09T  11.8T        -         -    15%    40%  1.00x  DEGRADED  -
> sudo zpool status
  pool: tank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0 days 02:45:39 with 0 errors on Sun Jan  9 03:09:41 2022
config:

        NAME                               STATE     READ WRITE CKSUM
        tank                               DEGRADED     0     0     0
          ata-ST2000DM001-1ER164_Z4Z0xxxx  ONLINE       0     0     0
          ata-ST2000DM001-1ER164_Z4Z0xxxx  DEGRADED    96     0     0  too many errors
          scsi-35000cca22dc7xxxx           ONLINE       0     0     0
          scsi-35000cca22dc7xxxx           ONLINE       0     0     0
          scsi-35000cca22dc8xxxx           ONLINE       0     0     0
          scsi-35000cca22dc8xxxx           ONLINE       0     0     0
          scsi-35000cca22dc7xxxx           ONLINE       0     0     0
          scsi-35000cca22dc7xxxx           ONLINE       0     0     0
          scsi-35000cca22dc7xxxx           ONLINE       0     0     0
          ata-ST2000DM001-1ER164_Z4Z3xxxx  ONLINE       0     0     0
          ata-ST2000NM0011_Z1P3xxxx        ONLINE       0     0     0

I predict the answer is that when I am ready to replace the failing disk, I will want to first destroy the pool, replace the disk, and build a new pool.

asciiphil · Answer 1 · 2022-01-15T00:16:02.297

First off, ZFS needs all toplevel vdevs to be functional in order for the pool to operate. If one vdev goes offline, you will lose access to all of the data in the pool. You are using individual disks as vdevs, so if that disk fails (as opposed to its current state of "many read errors"), you will have to recreate the entire pool from scratch.

If you are on Solaris or if you are using OpenZFS 0.8 or later, you should be able to run:

zpool remove tank ata-ST2000DM001-1ER164_Z4Z0xxxx

This might not work! And if it does, it might give a permanent performance degradation for the pool.

vdev removal requires there to be enough space on the remaining disks for the displaced data. It looks like you probably have enough room in this case, but I'm mentioning the problem for completeness.

On OpenZFS, at least, there are a number of restrictions on when you can remove vdevs. You can only remove a vdev if your pool consists solely of single-disk vdevs and/or mirrored vdevs. Your pool qualifies, because you're using single-disk vdevs exclusively. But if you had any raidz, draid, or special-allocation vdevs on OpenZFS, you wouldn't be able to do this.

A final caveat is that removing a vdev incurs a permanent performance penalty in OpenZFS. OpenZFS will record an internal table for all of the blocks that had previously been on the removed disk. For as long as those blocks exist in the pool from that point on, all access to them will require an indirect lookup through the remapping table. This can slow down random access significantly. I don't know enough about Solaris ZFS internals to be able to say whether it does anything similar.

And, of course, ZFS will need to read all of the data from the failing disk in order to remove it. It is entirely possible that it will encounter enough errors during that process that it will simply fail the disk. If that happens, as discussed earlier, the entire pool will go offline and will likely be unrecoverable.

If you have an available slot to add a disk, you might be better off putting in a spare disk and using zpool replace to substitute the new disk for the failing one. That will incur the same read load to copy the data off (and will bear the same risks of the single disk failing during the process), but if it succeeds you won't need to worry about the potential drawbacks that come with vdev removal.

In general, ZFS can be very brittle when used as you are—with non-redundant single-disk vdevs. There's an old joke that the zero in RAID0 is how much you must care about your data. A ZFS pool of single-disk vdevs is essentially the same as RAID0 from a data security standpoint. A failure of any single disk will likely cause you to lose all of your data. Even if you can afford to replace the data, make sure you're taking into account the time required to do that replacement. If you can afford a performance penalty traded off for data security, consider putting your future pools' disks into raidz2 vdevs. If you can afford to trade usable disk space for data security (and possibly increased read performance), consider putting your future pools' disks into mirror vdevs.

Nice answer. I've noticed that experienced ZFS admins tend to use mirrors in surprising ways, and tend to limit vdevs to 5 devices or so... For a high performing low redundancy (~1.5 disk parity) backup pool such as this with 11 devices, is a raidz2 with 11 devices slower than say a stripe of two raidz1 vdevs? eg the four ata Seagates as raidz1, striped together with a six disk raidz1 of the scsi disks? — Tomachi, Jul 15 '22 at 17:18
@Tomachi A single vdev will almost always be less performant than if you split the same disks across two vdevs. ZFS writes all data to a vdev in a single go, so the write has to wait for the slowest disk to finish before it's done. Unless your storage topology has a different natural division, I would try not to put more than ten disks in a single raidz vdev. I wouldn't use raidz1 for modern disks. If I had 11 disks, I'd prefer, in order: (1) five mirror vdevs with a spare, (2) two five-disk raidz2 vdevs with a spare, (3) one eleven-disk raidz2 dev. — asciiphil, Jul 18 '22 at 15:34
i think i finally get it! all top level vdevs are striped by default? that part of zfs is not configurable yeah? this finally explains the weird use of mirrors i see! it says the stripe is a safe topology which it isn't and explains all my confusion. :) — Tomachi, Aug 12 '22 at 12:20
You can think of top level vdevs as being striped. Technically, ZFS *distributes* writes across the vdevs in proportion to the amount of free space on each one. That's pretty fundamental to the design of ZFS and can't be changed, as far as I know. But you can implement all sorts of topologies by your choice of redundancy structure within vdevs. — asciiphil, Oct 21 '22 at 14:03

Can I remove a disk from a ZFS stripe?

1 Answers1