ZFS: How do you restore the correct number of copies after losing a drive?

Question

With zfs, if you have copies=2 and then you lose a drive containing some of those copies, how do you tell the system that it should make a new copy of the data blocks for the affected files? Or does zfs just start adding data blocks for the extra copies as soon as it finds out about bad data blocks?

Will scrub do this?

(v0.6.0.56-rc8, ZFS pool version 28, ZFS filesystem version 5, Ubuntu 11.10)

score 12 · Accepted Answer · answered Apr 10 '12 at 11:32

12

"copies=2" (or 3) is more designed to be used with pools with no redundancy (single disk or stripes). The goal is to be able to recover minor disk corruption, not a whole device failure. In the latter case, the pool is unmountable so no ditto blocks restoration can occur.

If you have redundancy (mirroring/raidz/raidz2/raidz3), the ditto blocks are not different than other ones and scrubbing/resilvering will recreate them.

answered Apr 10 '12 at 11:32

jlliagre

8,861
18
36

This directly conflicts with what @Redmumba says - and Redmumba provides links to code. Can you cite some sources for what you're saying? In particular, I'd love to see good citations for why you think copies=N isn't going to cope with whole device failure - that doesn't match with anything I've read. – James Moore Apr 11 '12 at 03:58
3

@James Moore After a whole device failure, no ditto blocks will be written on that disk. There is no redundancy at the pool level so there is no way to replace the faulty disk by a new one. The only method to properly recover that situation would be to do a full backup of the pool, recreate it with healthy devices, and restore from backup while making sure no unintentional reboot occur before the first backup is done. Otherwise the pool might not be importable and its data lost. This is quite a burden compared to redundant pools where recovering a bad disk is done on-line and survives reboots. – jlliagre Apr 11 '12 at 22:07
"There is no redundancy at the pool level so there is no way to replace the faulty disk by a new one." This doesn't seem relevant. Why do I care about redundancy at the pool level? copies=3 gives me redundancy at the file level. It sure seems like replacing the failing disk + scrub gets back to a good state. – James Moore Apr 11 '12 at 22:54
1

@James Moore: I understand your point and you are correct stating that in there is enough data redundancy for a disk failure not to affect stored data. However, my understanding is you still have to care about redundancy at the pool level because the pool itself doesn't care about redundancy at the dataset level. With a striped pool, you can only replace healthy devices, i.e. reported as ONLINE. A faulty device, i.e. in UNAVAIL state, is not replaceable. – jlliagre Apr 12 '12 at 00:50
"With a striped pool, you can only replace healthy devices" - reference, please? Nothing I see in the doc says that this is true. What exactly am I missing? And even if I am missing something that says I can't replace that drive - so what? I care about data. As long as there are multiple copies of the data, the right functionality is there. – James Moore Apr 12 '12 at 07:08
3

Here is a reference: http://docs.oracle.com/cd/E19082-01/817-2271/gbbvf/index.html#6mhupg6rl `For a device to be replaced, the pool must be in the ONLINE state. The device must be part of a redundant configuration, or it must be healthy (in the ONLINE state).` I assume copies=2 or 3 is not considered to be a redundant configuration. – jlliagre Apr 12 '12 at 12:17
OK, that's interesting. I assumed the opposite: obivously copies=N is a redundant configuration, but now I suspect that you're right and I'm wrong. Unfortunate that they don't also document that here: http://docs.oracle.com/cd/E19082-01/817-2271/gazgd/index.html – James Moore Apr 12 '12 at 14:57
copies=N is definitely not a redundant configuration as far as the pool is concerned. This is what the zpool command checks. Even at the dataset level, there is no guarantee for ditto blocks to exist at all. For example, copies=N might have been set after some blocks have been written. These blocks have no copies. – jlliagre Apr 12 '12 at 15:04
Sure, but that's OK - I'd expect that status would just say that some files are lost. (OK clearly meaning "we've got _something_ useful here", not "everything's fine"). Now I'm wondering why they even added the copies=N feature. – James Moore Apr 12 '12 at 15:16
1

One thing to keep in mind, though, is that if you originally had `copies=1` and you've upped it to `copies=2`, then you'll probably want to resilver/rescrub afterwards--which will create these instances. But @jilliagre is correct: ditto blocks do not constitute a redundant configuration. There is NO guarantee that the blocks are set on another device, even if you have multiple devices in a pool. – Andrew M. Apr 12 '12 at 16:44
3

the "copies=N where N>1" feature is not intended to add redundancy. it is intended to resolve data corruption. everything written to zfs is checksummed or hashed. when it's read back, the checksum/hash is verified. if N=1, then a checksum/hash verification failure results in an error back to the app. if N>1, then one of the other copies can be consulted and used to repair all other copies. – longneck Jul 24 '12 at 13:39

Andrew M. · Answer 2 · 2012-04-10T06:44:37.337

I found this question really intriguing, and after spending an hour pouring over documentation, I dived into the code. Here's what I found.

First, some terminology. Ditto blocks (which are what these copies are, as opposed to mirrors) are automatically created on a write but may or may not be in the same virtual device (vdev) as the original copy. On the other hand, mirrored blocks are always reflected onto another virtual device.

However, the code refers to both types of blocks as children. You'll see here that ditto blocks are just children with io_vd == NULL (this is in the write function). For a mirrored block, io_vd would be set to the corresponding virtual device (your second disk, for example).

With that in mind, when it gets to the read portion, it treats all children (be they mirror or ditto blocks) as potentially unsafe if it doesn't contain the expected good_copies, and rewrites them as needed. So it sounds like the answer to your question is--yes, it will rewrite them when you have at least one good copy, and either of the following:

Unexpected errors when you tried to read the data,
You are resilvering, or
You are scrubbing.

Phew! Maybe someone can point out flaws, but I enjoyed learning about ZFS through this little exercise, and I hope this helps!

The problem is in @jlliagre's answer - the pool is dead if it loses any device. The fact that the pool still has enough ditto blocks doesn't seem to matter. Any way around that? — James Moore, Apr 12 '12 at 16:09
@JamesMoore you can force the array online in a degraded state if you have the first 1MB of the device that failed. Presumably you just need the metadata from the failed device. I've tested this with a jbod-style zpool and it works : [recovering raidz broken labels](http://mail.opensolaris.org/pipermail/zfs-discuss/2012-June/051731.html). I did an md5sum before and after I broke the zpool, and only the copies=1 filesystem was broken after the import. The copies=2 and copies=3 filesystems matched up perfectly. — Jodie C, Jun 27 '12 at 03:12

Aaron B · Answer 3 · 2015-07-28T19:06:40.740

@jlliagre and others who seem to think that the entire zpool dies if it one of the disks (vdevs) dies but the pool is not redundant (mirror/raidz). This is not true; a multi-disk pool will always survive a single complete disk failure even if it is not a mirror or raidz.

ZFS Metadata is always copied at least 2 times so total failure of a complete disk (or any portion of it) will not take down the file system. Furthermore, many files, especially smaller ones, will not be spread across all disks and will therefore not necessarily be faulted by the disk failure. The OP is asking about the case of a multi-disk pool using ditto blocks (user data copies > 1). Here, a single complete disk failure ~~should never result in any data loss.~~ ZFS will always try to put ditto blocks far away from the original block, and for pools with multiple vdevs, this always means on another vdev (an exception might be where one vdev is >50% of the pool, which would be very unusual). File system meta data is also always copied +1 or +2 times more than the ditto level, so it will always survive disk failure. Furthermore, if you have a pool more than three disks, you should be able to lose up to half of them without any data loss; ZFS stores the ditto blocks on the next disk over so as long as you never lose two adjacent disks, you never have data loss. (three adjecent disk failure for ditto=2).

When there are sufficient copies of data to access a file (whether those copies are from ditto blocks, mirror, or raidz), then all missing copies of data are repaired when the file is accessed. This is the purpose of the scrub; read all data and fix any that is bad by making use of redundant copies. So to answer the OP question directly, you just need to do a scrub after replacing the failed drive, and all copies will be restored.

As always, you can easily experiment with the concepts by creating pools whose vdevs for backing store are just ordinary sparse files. By deleting or corrupting the vdev files you can simulate any type of failure, and can verify integrity of the pool, file systems, and data along the way.

EDIT: after experimenting, it looks like zfs will fail the pool if a disk fails in a multi-disk non-redundant pool with copies>=2. Parital data corruption on one or more disks should remain survivable and should be fixed by a scrub.

The scary thing about those sorts of experiments is that they're great for telling me a setup will fail immediately or at least quickly. They're not so great for telling me that a setup will fail occasionally. In any case, it's not clear how you bring back a pool that has a failure; I tried setting up a pool like this with three sparse files and removing one of the sparse files seems to be fatal to the entire pool. zpool replace won't replace the failed file, zpool scrub stalls at 5% (and these are very small pools), and the error page at http://illumos.org/msg/ZFS-8000-5E isn't optimistic. — James Moore, Jul 28 '15 at 16:48
I had a similar result to my experiements, done only after my answer. I normally only use raidz, and was answering based on information from what I believed to be credible sources (oracle blogs). I no longer believe that a multi-disk JBOD type pool, with copies >1 can survive a disk failure. — Aaron B, Jul 28 '15 at 18:59

ZFS: How do you restore the correct number of copies after losing a drive?

3 Answers3

Linked