It was decided to replace our aging primary NAS, consisting of three 48 drive SAS expanders of 4TB drives, with a similar system of 12TB drives while reusing some of the newer hardware, one expander and SAS card that was added on about a year ago. The decision was made to keep things as simple and as cheap as possible while not taking up any additional rack space in the end.
The new hardware arrived, the server and two expanders, and was set up with Debian Buster and the ZFS available on the buster-backports repository. The ZFS pool was created with a mirror of two U.2 SSD drives for the log, two more U.2 SSD drives for the cache, 4 HDD spares(2 per expander), and 12 RAID-Z2 raids of 7 drives each(6 raids per expander). Everything was looking good and I started copying the data from the old NAS to this one using a script that made use of incremental snapshots, zfs send, and zfs receive.
The first run of the script took many days but eventually finished. No problems on either end. The second run worked as well. Then after the third many problems were noted with the ZFS Pool. In 4 raids a large number of disks had changed status to UNAVAILABLE or FAILED and all 4 spares were put into use automatically. The output of zpool status follows.
# zpool status
pool: bigvol
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Jan 28 09:55:20 2021
160T scanned at 11.5G/s, 151T issued at 10.8G/s, 160T total
4.99T resilvered, 94.53% done, 0 days 00:13:46 to go
config:
NAME STATE READ WRITE CKSUM
bigvol DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
scsi-35000c500cacd481b ONLINE 0 0 0
scsi-35000c500cacceddb ONLINE 0 0 0
scsi-35000c500cacd5c4b ONLINE 0 0 0
scsi-35000c500cacd19cb ONLINE 0 0 0
scsi-35000c500cacd0f4f ONLINE 0 0 0
scsi-35000c500cacd5efb ONLINE 0 0 0
scsi-35000c500cacd133f ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
scsi-35000c500cab6617f ONLINE 0 0 0
scsi-35000c500cacd131b ONLINE 0 0 0
scsi-35000c500cacd1637 ONLINE 0 0 0
scsi-35000c500cacd0dd3 ONLINE 0 0 0
scsi-35000c500cab64247 ONLINE 0 0 0
scsi-35000c500cacd5f4b ONLINE 0 0 0
scsi-35000c500cacd206b ONLINE 0 0 0
raidz2-2 ONLINE 0 0 0
scsi-35000c500cacd251f ONLINE 0 0 0
scsi-35000c500cacf60a7 ONLINE 0 0 0
scsi-35000c500cacd55cb ONLINE 0 0 0
scsi-35000c500cacd3a5f ONLINE 0 0 0
scsi-35000c500cacd0fa7 ONLINE 0 0 0
scsi-35000c500cacd4cb3 ONLINE 0 0 0
scsi-35000c500cacd2edf ONLINE 0 0 0
raidz2-3 DEGRADED 0 0 0
scsi-35000c500cacd1627 ONLINE 0 0 0
scsi-35000c500cacd049f ONLINE 0 0 0
scsi-35000c500cacdf9d3 ONLINE 0 0 0
scsi-35000c500cab51563 DEGRADED 0 0 1 too many errors (resilvering)
scsi-35000c500cacd1c9b DEGRADED 0 0 0 too many errors
scsi-35000c500cacdf757 FAULTED 0 10 48 too many errors (resilvering)
scsi-35000c500cacd291b FAULTED 0 11 31 too many errors (resilvering)
raidz2-4 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 11
scsi-35000c500cacdb54f FAULTED 0 18 0 too many errors (resilvering)
scsi-35000c500cacdc907 DEGRADED 0 0 0 too many errors (resilvering)
scsi-35000c500cacd2c77 DEGRADED 0 0 4 too many errors
scsi-35000c500cacdbdd3 DEGRADED 0 0 12 too many errors (resilvering)
scsi-35000c500cacd0a47 DEGRADED 0 0 7 too many errors (resilvering)
scsi-35000c500cacdf107 DEGRADED 0 0 4 too many errors (resilvering)
scsi-35000c500cacd59fb DEGRADED 0 195 79 too many errors (resilvering)
scsi-35000c500cacd5307 DEGRADED 0 177 30 too many errors (resilvering)
raidz2-5 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 15
scsi-35000c500cacd03a3 FAULTED 0 12 0 too many errors (resilvering)
scsi-35000c500cacd340b ONLINE 0 0 0
scsi-35000c500cacd29d7 FAULTED 0 21 24 too many errors (resilvering)
scsi-35000c500cacd23d7 DEGRADED 0 0 11 too many errors (resilvering)
scsi-35000c500cacd1c27 DEGRADED 0 0 29 too many errors (resilvering)
spare-4 DEGRADED 0 0 32
scsi-35000c500cacd26bb FAULTED 0 31 0 too many errors (resilvering)
scsi-35000c500cacd299f DEGRADED 0 0 0 too many errors (resilvering)
scsi-35000c500cacd258b DEGRADED 0 207 63 too many errors (resilvering)
spare-6 DEGRADED 0 0 24
scsi-35000c500cacdf867 FAULTED 0 15 0 too many errors (resilvering)
scsi-35000c500cacd60ef ONLINE 0 0 0
raidz2-6 DEGRADED 0 0 0
scsi-35000c500cacd2e37 ONLINE 0 0 0
scsi-35000c500cacd0ecf ONLINE 0 0 0
11839096008852004814 UNAVAIL 0 0 0 was /dev/disk/by-id/scsi-35000c500cacd1f8f-part1
scsi-35000c500cacd088b ONLINE 0 0 0
scsi-35000c500cacd28df ONLINE 0 0 0
scsi-35000c500cacd068b ONLINE 0 0 0
scsi-35000c500cacdbd77 ONLINE 0 0 0
raidz2-7 ONLINE 0 0 0
scsi-35000c500cacd040b ONLINE 0 0 0
scsi-35000c500cacd16bb ONLINE 0 0 0
scsi-35000c500cacd4d37 ONLINE 0 0 0
scsi-35000c500cacd1b57 ONLINE 0 0 0
scsi-35000c500cacd0453 ONLINE 0 0 0
scsi-35000c500cacd3f6b ONLINE 0 0 0
scsi-35000c500cacd0297 ONLINE 0 0 0
raidz2-8 ONLINE 0 0 0
scsi-35000c500cacd4bcb ONLINE 0 0 0
scsi-35000c500cacd36cf ONLINE 0 0 0
scsi-35000c500cacd1983 ONLINE 0 0 0
scsi-35000c500cacd3aaf ONLINE 0 0 0
scsi-35000c500cacda90b ONLINE 0 0 0
scsi-35000c500cacd0d53 ONLINE 0 0 0
scsi-35000c500cacdaa1f ONLINE 0 0 0
raidz2-9 ONLINE 0 0 0
scsi-35000c500cacd3f13 ONLINE 0 0 0
scsi-35000c500cacd3187 ONLINE 0 0 0
scsi-35000c500cacd59a3 ONLINE 0 0 0
scsi-35000c500cacd0913 ONLINE 0 0 0
scsi-35000c500cacdf663 ONLINE 0 0 0
scsi-35000c500cacd156b ONLINE 0 0 0
scsi-35000c500cacd203f ONLINE 0 0 0
raidz2-10 ONLINE 0 0 0
scsi-35000c500cacd4c97 ONLINE 0 0 0
scsi-35000c500cacd58a3 ONLINE 0 0 0
scsi-35000c500cacd2353 ONLINE 0 0 0
scsi-35000c500cacd3f67 ONLINE 0 0 0
scsi-35000c500cacd235f ONLINE 0 0 0
scsi-35000c500cacdf14f ONLINE 0 0 0
scsi-35000c500cacd2583 ONLINE 0 0 0
raidz2-11 ONLINE 0 0 0
scsi-35000c500cacd2f87 ONLINE 0 0 0
scsi-35000c500cacdb557 ONLINE 0 0 0
scsi-35000c500cacd00f3 ONLINE 0 0 0
scsi-35000c500cacd3ea7 ONLINE 0 0 0
scsi-35000c500cacd23ff ONLINE 0 0 0
scsi-35000c500cacd09d3 ONLINE 0 0 0
scsi-35000c500cacd3adb ONLINE 0 0 0
logs
mirror-12 ONLINE 0 0 0
nvme-eui.343842304db011100025384700000001 ONLINE 0 0 0
nvme-eui.343842304db011060025384700000001 ONLINE 0 0 0
cache
nvme-eui.343842304db010920025384700000001 ONLINE 0 0 0
nvme-eui.343842304db011080025384700000001 ONLINE 0 0 0
spares
scsi-35000c500cacdc907 INUSE currently in use
scsi-35000c500cacd299f INUSE currently in use
scsi-35000c500cacd340b INUSE currently in use
scsi-35000c500cacd60ef INUSE currently in use
errors: No known data errors
I have stopped the transfer, for obvious reasons, and am waiting for the resilvering to end before I replace the FAULTED and UNAVAIL drives. However I would like to know if the DEGRADED drives should be replaced? Also if anyone has an idea as to why this might happen?(Beyond the possibility of just a bad set of drives.) Or perhaps I just need to kill the pool and replace the drives. Either way I'm thinking the data will have to be copied once again.