ZFS capacity usage discrepancy

Question

I’m trying to understand why there is a large discrepancy with the used space of a production ZFS dataset and a backup dataset that is being populated with a nightly zfs send (I keep 30 daily snapshots and replicate nightly - no other systems write to or otherwise access the backup dataset). Compression and deduplication are not enabled on either side. The backup dataset is reporting 315T used while production is only using 311T (the two systems are essentially mirrored in terms of hardware). My issue is that the nightly zfs sends are now failing (out of space).

A follow up question is if there is an obvious way out of this issue? The backup pool shows 10.7T free, but that doesn’t seem to be available to the dataset as it only reports 567G free. If I was to destroy the backup pool and perform a full zfs send of the production data, would we expect it to complete? I've already destroyed all but the most recent two snapshots on the backup dataset, but it didn't free enough space to allow a new zfs send. I purposely set a quota of 312T on the production dataset to help keep users in check as they’ll often work near 100% full but it seems that quota may not have been enough? (there is no quota defined on the backup pool/dataset)

Production system:
# zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
data      326T   311T  15.3T         -    44%    95%  1.00x  ONLINE  -

# zfs list
NAME                                                 USED  AVAIL  REFER  MOUNTPOINT
data                                                 311T  5.11T    96K  /data
data/lab                                             311T  1.30T   306T  /data/lab


Backup system:
# zpool list
NAME        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
backup      326T   315T  10.7T         -     6%    96%  1.00x  ONLINE  -

# zfs list
NAME                                                   USED  AVAIL  REFER  MOUNTPOINT
backup                                                 315T   567G    96K  /backup
backup/lab                                             315T   567G   315T  /backup/lab

+ You'd better had compression turned on, `lz4` is the good default choice — very low overhead, but high delivery. **Don't run `dedup` though**. — poige, Sep 30 '19 at 16:44
`sudo zdb -C …poolName…` — that info could be useful for analysis. If you don't mind sharing it — put it somewhere like on "paste"-services or, say, Github's gist. — poige, Sep 30 '19 at 16:49
The zdb output for the two zpools can be found here: https://pastebin.com/2BGAUxZs I noticed the backup pool's output contained 'space map refcount mismatch: expected 660 != actual 652' and I'm now looking to see if this is related to the discrepancy. — user, Sep 30 '19 at 17:22
I have a dim memory about the free space shown in `zpool list` being wrong and the free space shown in `zfs list` being correct, but I don't remember why that is or where I read it right now. I'll see if I can find it. — Michael Hampton, Sep 30 '19 at 18:59
@MichaelHampton rather than being "wrong", the free space reported by `zpool list` is misleading: for example it does not account for RAIDZ parity overhead. This is because `zpool` accounts for *pool-level* free space, which do not directly traslate to *filesystem* free space. To check how much "userland" free space really exists, one should use `zfs list`. — shodanshok, Sep 30 '19 at 19:47
Are you using reservations? Why do `data` and `data/lab` have different amounts of free space? — Jim L., Sep 30 '19 at 22:48
@JimL. Yes, there is a quota of 312T set on data/lab, which I believe explains the difference in free space. There are also a number of snapshots to explain the difference between the used and referenced amounts. There is no such quota on the backup system, and at the time I pulled those numbers I had purged almost all snapshots on the backup system. — user, Sep 30 '19 at 23:35
@user can you show the output of `zfs list -o space` on both primary and backup pool? — shodanshok, Oct 01 '19 at 05:59
@shodanshok see output here: https://pastebin.com/e5cN2NDe Keep in mind, I've since set spa_slop_shift to 6 and performed a zfs send to transfer the latest snapshots so capacities have moved since my first post. That said, I'm still unclear why the production dataset's is using 305T while the backup is 317T (which should be a mirror image as of last midnight when the last zfs send was performed, right?) — user, Oct 01 '19 at 15:50

shodanshok · Answer 1 · 2019-10-03T08:47:16.657

Based on the output of zdb your destination (backup) pool does not have a SLOG device. This means that additional ZIL block are allocated inside the pool regular space, eating into available space. While the main pool SLOG device is ~185GB "only" (which is way too much for a regular SLOG), the impact on the destination pool can be quite bigger, as new blocks are continuously allocated for ZIL work. Moreover this can lead to over-fragmented metaslabs, with more total allocated space.

EDIT: Another possible cause of size discrepancy can be related to zfs send/recv itself: by default it runs with maximum compatibility options, which can somewhat inflate the receiving side. For more information, you can see the mailing post here

NOTE: the above answer made the explicit assumption that all other things are and were equal - for example, if you temporarily enabled compression on the source pool, space allocation will clearly differ compared to the destination pool (if it never had compression enabled).

FIX: I don't think that adding a SLOG for the backup pool can be useful now; rather, I would adjust spa_slop_shift (increasing it) to recover some significant space. The default value is 5, try setting it to 6 by issuing echo 6 > /sys/module/zfs/parameters/spa_slop_shift. A subsequent zfs list should report significantly more available space (with no change on allocated space as shown by zpool list).

If you need even more space, you can again increase spa_slop_shift - but be sure to read the documenation to understand what you are doing.

Thanks! Your suggested fix looks good. I’m still uncertain of the root cause. I’m unable find much information on this topic i.e. (1) how/why ZIL data is packaged and transmitted with the zfs send snap stream (the backup dataset referenced nearly 9T more than the production dataset), and (2) the negative effects of an overly large SLOG (I realize much of the SLOG capacity will go unused but I'm curious if there would be a benefit seen by creating a smaller partition on the device). — user, Sep 30 '19 at 23:12
@user I updated the answer with information take from the mailing list. Give it a look. — shodanshok, Oct 03 '19 at 08:47

ZFS capacity usage discrepancy

1 Answers1