ZFS slow scrub on two HDDs

Question

The pool consist of two HDDs (WD Red 3 TB, 5200 RPM?, max transfer rate 147 MB/s & Verbatim (Toshiba) 3 TB, 7200 RPM) in raidz1-0 configuration. It has 2.25 TB of data, duplicated to two disks so the total amount is 4.5 TB. When I created the pool I did not specify an ashift value.

zpool status shows that "scan: scrub repaired 0 in 32h43m with 0 errors on Sun Jan 3 13:58:54 2021". This means that the scan speed was only 4.5e6 / (32.717 * 60 * 60) = 38.2 MB / s. I'd expect at least 2 x 100 or up-to 2 x 200 MB/s, although the WD disk is somewhat slower than the other.

SMART-data of the disks shows that everything is healthy. They have 6.5 - 7 years of power-on time but the start-stop count is only about 200.

So the main question: What might explain the poor read performance?

Oddly zdb showed that the pool uses the path /dev/disk/by-id/ata-WDC_WD30EFRX-xyz-part1 rather than /dev/disk/by-id/ata-WDC_WD30EFRX-xyz. fdisk -l /dev/disk/by-id/ata-WDC_WD30EFRX-xyz mentions that "Partition 1 does not start on physical sector boundary", but I read that it should only hurt write performance. I might try fixing this by removing the device and adding it back with the proper full-disk path, since the data is duplicated (and backed up).

The pool has 7.1 million files. I tested running sha1sum on a 14276 MB file after clearing caches via /proc/sys/vm/drop_caches, it took 2 min 41 s putting the read speed at 88.5 MB/s.

dd bs=1M count=4096 if=/dev/disk/by-id/ata-WDC_WD30EFRX-xyz of=/dev/null reported a speed of 144 MB/s, using it on ata-WDC_WD30EFRX-xyz-part1 reported 134 MB/s and ata-TOSHIBA_DT01ACA300_xyz reported 195 MB/s.

My NAS runs quite old software versions:

$ modinfo zfs
filename:       /lib/modules/3.11.0-26-generic/updates/dkms/zfs.ko
version:        0.6.5.4-1~precise
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     5FC0B558D497732F17F4202
depends:        spl,znvpair,zcommon,zunicode,zavl
vermagic:       3.11.0-26-generic SMP mod_unload modversions

It has 24 GB of RAM, 8 GB of which is reserved for a JVM but the rest is free to be used. Although not that much of it seems to be free:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         23799      21817       1982          0        273       1159
-/+ buffers/cache:      20384       3415
Swap:         7874         57       7817

Edit 1:

I did some tests with bonnie++, using a single 4 GB file on the RAIDZ: write 75.9 MB/s, rewrite 42.2 MB/s and read 199.0 MB/s. I assume I did the conversion correctly from the "kilo-characters / second".

Ah, just now I realized that the parallel scrub takes as long as the slowest 5400 RPM disk, it doesn't matter that the 7200 RMP was (possibly) scrubbed faster.

Edit 2:

I reduced the number of files in the pool from 7.1 million to 4.5 million (-36.6%) and the scrub time was dropped from 32.72 hours to 16.40 hours (-49.9%). The amount of data is the same since I just put those small files into a low-compressed ZIP.

I also increased the recordsize from 128k to 512k, no clue if this made a difference in this case. Other pre-existing data was not touched so they retain the original recordsize. Oh and /sys/module/zfs/parameters/zfs_scan_idle was set to 2.

`dd bs=1M count=4096 if=/dev/disk/by-id/ata-WDC_WD30EFRX-xyz of=/dev/null` is not really a good way to demonstrate disk performance under normal usage. All that `dd` command does is stream data - there's effectively no time lost to seeking. Reading a random sector means the heads have to move to read the track, and then the disk has to wait for the sector to rotate under the heads. 5K rpm drives are **S-L-O-W** at that. 40 MB/sec for a scrub on such disks isn't poor at all. — Andrew Henle, Jan 18 '21 at 14:14
I noticed that at the start of a scrub I may have a read io bound txg_sync process, this "phase" of the scrub is slow. than once that becomes unblocked, larger `zfs_top_maxinflight` numbers seem to help with getting higher `zpool iostat -v` read numbers. `iostat -x` shows same util% no matter how fast or slow the scrub is going — ThorSummoner, Oct 18 '21 at 19:39

shodanshok · Accepted Answer · 2021-01-19T07:37:50.580

What version of ZFS are you running?

Pre-0.8.x scrubs by traversing all metadata and data, as they are layed out on disks. This causes many seeks that kills performance on mechanical disks. If used with low-perf 5K RPM disks filled with millions of small files, this means very long scrub/resilver times. With these older ZFS versions, you can tune some ZFS tunables; for example:

echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 0 > /sys/module/zfs/parameters/zfs_scan_idle

Be aware that increasing scrub priority would led to slower application performance.

0.8.x uses a batched scrub approach, where metadata are collected in larger batches and only then the relevant data are scanned. This results in much faster (ie: half the time) scrubs, without the needing to tune anything (the above knobs are not even present anymore).

So the more effective method to increase scrub/resilver speed probably is to update your ZFS version.

I was aware that 0.8 brought performance improvements but thanks for going into details. This system is running 0.6.5. Before upgrades in the long term I'll try tuning the `zfs_scan_idle` and archive small files into zips since I don't need active access to them. — NikoNyrh, Jan 18 '21 at 23:42
@NikoNyrh Note that Ubuntu 12.04 is **far** past end of life. You should update the entire system ASAP. — Michael Hampton, Jan 19 '21 at 08:15

ZFS slow scrub on two HDDs

1 Answers1