Why is `zfs list -t snapshot` orders of magnitude slower than `ls .zfs/snapshot`?

Question

With all ZFS-on-Linux versions I've ever tried, using zfs list to list all snapshots of a filesystem or volum (zfs list -r -t snapshot -H -o name pool/filesystem) always takes many orders of magnitude more time to run than ls .zfs/snapshot, which is immediate:

$ time ls -1 /srv/vz/subvol-300-disk-1/.zfs/snapshot
[list of 1797 snapshots here]
real    0m0.023s
user    0m0.008s
sys     0m0.014s

# time zfs list -r -t snapshot -H -o name vz/subvol-300-disk-1
[same list of 1797 snapshots]
real    1m23.092s
user    0m0.110s
sys     0m0.758s

Is this bug specific to ZFS-on-Linux?

Can anybody with a Solaris or FreeBSD ZFS box perform a similar test (on a filesystem with hundreds of snapshots on spinning hard disks)?

Is there a workaround to get a quick list of snapshots for a volume, which by its nature does not have a .zfs directory?

I've run the above test with ZFS-on-Linux 0.6.5.2-2-wheezy on kernel 2.6.32-43-pve x86_64 (Proxmox) but I've always seen this issue, both on older and newer ZFS and kernel versions.

Here are the pool stats:

# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
vz    25.2T  9.42T  15.8T         -     5%    37%  1.00x  ONLINE  -

It contains 114 filesystems and 1 volume, each with hundreds of snapshots, as this is a zfs send / zfs recv backup server.

Solution: zfs list is slow because it fetches additional information, even if it's not displayed. The solution is adding both -o name -s name, that is, using zfs list -t snapshot -o name -s name

What's the output of `zpool list` and `zfs list`? E.g. how full is your filesystem? — ewwhite, Aug 23 '16 at 09:45

score 4 · Answer 1 · answered Aug 23 '16 at 10:20

4

zfs list -t snapshot always takes many orders of magnitude more time to run than ls .zfs/snapshot

You're also comparing two completely different operations.

zfs list -t snapshot enumerates all the ZFS snapshots on the system - and provides a lot of information about those snapshots, such as the amount of space used. Run that under strace to see the system calls made.

ls .zfs/snapshot is just emitting a simple name list from a directory. There's nothing to do other than read the names - and provide nothing else.

answered Aug 23 '16 at 10:20

Andrew Henle

1,262
9
11

If you look at my actual example, I'm running `zfs list -r -t snapshot -H -o name pool/filesystem` which outputs the same data as `ls .zfs/snapshot` – Tobia Aug 23 '16 at 10:44
4

@Tobia `zfs list` would still most likely collect all the data, and then just filter the output as you requested. It could be rewritten to be smarter and do not collect expensive data if they are not needed for output, but optimizing for such an edge case is usually not constructive use of time. Feel free to implement it, though! – Matija Nalis Aug 23 '16 at 10:49
1

A filesystem `ls` is a totally different operation than the `zfs list -t snapshot`. Just because the output is the same doesn't mean that the same thing is happening behind the scenes. – ewwhite Aug 23 '16 at 10:49

ewwhite · Accepted Answer · 2016-08-23T10:59:29.150

Snapshot operations are a function of the number of snapshots you have, RAM, disk performance and drive space. This would be a general ZFS issue, not something unique to the Linux variant.

The better question is: Why you have 1797 snapshots of a zvol? This is definitely more than recommended and makes me wonder what else is happening on the system.

People say "ZFS snapshots are free", but that's not always true.

While ZFS snaps don't have an impact on production performance, the high number you have clearly require disk accesses to enumerate.

Disk access time > RAM access time, hence the order of magnitude difference.

strace output. Note the time per syscall and imagine how poorly it would scale with the number of snapshots in your filesystem.

# strace -c ls /ppro/.zfs/snapshot

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        10           read
  0.00    0.000000           0        17           write
  0.00    0.000000           0        12           open
  0.00    0.000000           0        14           close
  0.00    0.000000           0         1           stat
  0.00    0.000000           0        12           fstat
  0.00    0.000000           0        28           mmap
  0.00    0.000000           0        16           mprotect
  0.00    0.000000           0         3           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         2           ioctl
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         2           getdents
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           statfs
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         2         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                   133         2 total

versus

# strace -c zfs list -t snapshot

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.003637          60        61         7 ioctl
  0.00    0.000000           0        12           read
  0.00    0.000000           0        50           write
  0.00    0.000000           0        19           open
  0.00    0.000000           0        19           close
  0.00    0.000000           0        15           fstat
  0.00    0.000000           0        37           mmap
  0.00    0.000000           0        19           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         4           brk
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         3         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0         1           arch_prctl
  0.00    0.000000           0         2         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.003637                   250         9 total

The pool has 100+ filesystems and volumes, each with 1000+ snapshots, because this is used as a backup server. There is no IO except for adding new snapshots (with `zfs send -I ... | ssh ... 'zfs receive -F ...'`) and dropping old ones. In any case, why does `ls .zfs/snapshot` list the same exact data in an instant? It's enough for my purposes, but I can't find an equivalent for volumes. — Tobia, Aug 23 '16 at 10:51
You shouldn't have that many snapshots for one. It's best to keep them well below 1,000. The `ls` is faster and unaffected by the number of snapshots because it's a simple directory listing. — ewwhite, Aug 23 '16 at 10:55
It wouldnt't call it a "simple directory listing," because it's a virtual directory that calls specific kernel code, much like `/proc`. Do you have any official source about keeping the snapshots below 1000? This is a backup server, so the high number of filesystems and snapshots is the entire purpose of this server. — Tobia, Aug 23 '16 at 10:58
@Tobia An official source isn't necessary. This is common sense since you can _see_ that the time for the listing is a function of the number of snapshots. Again, this doesn't impact production, only snapshot operations. — ewwhite, Aug 23 '16 at 11:01
I disagree. In any case, thanks for the `strace -c` output, I didn't know it. My `zfs list` is performing 1 ioctl per snapshot, which is clearly collecting more data than just the name, as @MatijaNalis suggested. I'll see if I can come up with my own utility that only collects the names, if the kernel module API supports it. — Tobia, Aug 23 '16 at 11:09
@Tobia This has already been discussed on [the ZoL github](https://github.com/zfsonlinux/zfs/issues/450#issuecomment-8399888). there are workarounds proposed. — ewwhite, Aug 23 '16 at 11:13
That page contains the following workaround `zfs list -t snapshot -o name -s name` which works. I was missing the `-s name`. Thanks — Tobia, Aug 23 '16 at 11:20
I also disagree that "[1797 snapshots] is definitely more than recommended", [ZFS devs even say that thousands is not an issue](http://web.archive.org/web/20120211211239/http://mail.opensolaris.org/pipermail/zfs-discuss/2009-March/027191.html). What OP is seeing here is just O(N) time to enumerate the data on 1797 items. (I have over 40k snapshots on a production machine and performance is fine) — Josh, Jul 28 '18 at 15:54

Why is `zfs list -t snapshot` orders of magnitude slower than `ls .zfs/snapshot`?

2 Answers2