0

It was decided to replace our aging primary NAS, consisting of three 48 drive SAS expanders of 4TB drives, with a similar system of 12TB drives while reusing some of the newer hardware, one expander and SAS card that was added on about a year ago. The decision was made to keep things as simple and as cheap as possible while not taking up any additional rack space in the end.

The new hardware arrived, the server and two expanders, and was set up with Debian Buster and the ZFS available on the buster-backports repository. The ZFS pool was created with a mirror of two U.2 SSD drives for the log, two more U.2 SSD drives for the cache, 4 HDD spares(2 per expander), and 12 RAID-Z2 raids of 7 drives each(6 raids per expander). Everything was looking good and I started copying the data from the old NAS to this one using a script that made use of incremental snapshots, zfs send, and zfs receive.

The first run of the script took many days but eventually finished. No problems on either end. The second run worked as well. Then after the third many problems were noted with the ZFS Pool. In 4 raids a large number of disks had changed status to UNAVAILABLE or FAILED and all 4 spares were put into use automatically. The output of zpool status follows.

# zpool status
  pool: bigvol
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Jan 28 09:55:20 2021
    160T scanned at 11.5G/s, 151T issued at 10.8G/s, 160T total
    4.99T resilvered, 94.53% done, 0 days 00:13:46 to go
config:

    NAME                                           STATE     READ WRITE CKSUM
    bigvol                                         DEGRADED     0     0     0
      raidz2-0                                     ONLINE       0     0     0
        scsi-35000c500cacd481b                     ONLINE       0     0     0
        scsi-35000c500cacceddb                     ONLINE       0     0     0
        scsi-35000c500cacd5c4b                     ONLINE       0     0     0
        scsi-35000c500cacd19cb                     ONLINE       0     0     0
        scsi-35000c500cacd0f4f                     ONLINE       0     0     0
        scsi-35000c500cacd5efb                     ONLINE       0     0     0
        scsi-35000c500cacd133f                     ONLINE       0     0     0
      raidz2-1                                     ONLINE       0     0     0
        scsi-35000c500cab6617f                     ONLINE       0     0     0
        scsi-35000c500cacd131b                     ONLINE       0     0     0
        scsi-35000c500cacd1637                     ONLINE       0     0     0
        scsi-35000c500cacd0dd3                     ONLINE       0     0     0
        scsi-35000c500cab64247                     ONLINE       0     0     0
        scsi-35000c500cacd5f4b                     ONLINE       0     0     0
        scsi-35000c500cacd206b                     ONLINE       0     0     0
      raidz2-2                                     ONLINE       0     0     0
        scsi-35000c500cacd251f                     ONLINE       0     0     0
        scsi-35000c500cacf60a7                     ONLINE       0     0     0
        scsi-35000c500cacd55cb                     ONLINE       0     0     0
        scsi-35000c500cacd3a5f                     ONLINE       0     0     0
        scsi-35000c500cacd0fa7                     ONLINE       0     0     0
        scsi-35000c500cacd4cb3                     ONLINE       0     0     0
        scsi-35000c500cacd2edf                     ONLINE       0     0     0
      raidz2-3                                     DEGRADED     0     0     0
        scsi-35000c500cacd1627                     ONLINE       0     0     0
        scsi-35000c500cacd049f                     ONLINE       0     0     0
        scsi-35000c500cacdf9d3                     ONLINE       0     0     0
        scsi-35000c500cab51563                     DEGRADED     0     0     1  too many errors  (resilvering)
        scsi-35000c500cacd1c9b                     DEGRADED     0     0     0  too many errors
        scsi-35000c500cacdf757                     FAULTED      0    10    48  too many errors  (resilvering)
        scsi-35000c500cacd291b                     FAULTED      0    11    31  too many errors  (resilvering)
      raidz2-4                                     DEGRADED     0     0     0
        spare-0                                    DEGRADED     0     0    11
          scsi-35000c500cacdb54f                   FAULTED      0    18     0  too many errors  (resilvering)
          scsi-35000c500cacdc907                   DEGRADED     0     0     0  too many errors  (resilvering)
        scsi-35000c500cacd2c77                     DEGRADED     0     0     4  too many errors
        scsi-35000c500cacdbdd3                     DEGRADED     0     0    12  too many errors  (resilvering)
        scsi-35000c500cacd0a47                     DEGRADED     0     0     7  too many errors  (resilvering)
        scsi-35000c500cacdf107                     DEGRADED     0     0     4  too many errors  (resilvering)
        scsi-35000c500cacd59fb                     DEGRADED     0   195    79  too many errors  (resilvering)
        scsi-35000c500cacd5307                     DEGRADED     0   177    30  too many errors  (resilvering)
      raidz2-5                                     DEGRADED     0     0     0
        spare-0                                    DEGRADED     0     0    15
          scsi-35000c500cacd03a3                   FAULTED      0    12     0  too many errors  (resilvering)
          scsi-35000c500cacd340b                   ONLINE       0     0     0
        scsi-35000c500cacd29d7                     FAULTED      0    21    24  too many errors  (resilvering)
        scsi-35000c500cacd23d7                     DEGRADED     0     0    11  too many errors  (resilvering)
        scsi-35000c500cacd1c27                     DEGRADED     0     0    29  too many errors  (resilvering)
        spare-4                                    DEGRADED     0     0    32
          scsi-35000c500cacd26bb                   FAULTED      0    31     0  too many errors  (resilvering)
          scsi-35000c500cacd299f                   DEGRADED     0     0     0  too many errors  (resilvering)
        scsi-35000c500cacd258b                     DEGRADED     0   207    63  too many errors  (resilvering)
        spare-6                                    DEGRADED     0     0    24
          scsi-35000c500cacdf867                   FAULTED      0    15     0  too many errors  (resilvering)
          scsi-35000c500cacd60ef                   ONLINE       0     0     0
      raidz2-6                                     DEGRADED     0     0     0
        scsi-35000c500cacd2e37                     ONLINE       0     0     0
        scsi-35000c500cacd0ecf                     ONLINE       0     0     0
        11839096008852004814                       UNAVAIL      0     0     0  was /dev/disk/by-id/scsi-35000c500cacd1f8f-part1
        scsi-35000c500cacd088b                     ONLINE       0     0     0
        scsi-35000c500cacd28df                     ONLINE       0     0     0
        scsi-35000c500cacd068b                     ONLINE       0     0     0
        scsi-35000c500cacdbd77                     ONLINE       0     0     0
      raidz2-7                                     ONLINE       0     0     0
        scsi-35000c500cacd040b                     ONLINE       0     0     0
        scsi-35000c500cacd16bb                     ONLINE       0     0     0
        scsi-35000c500cacd4d37                     ONLINE       0     0     0
        scsi-35000c500cacd1b57                     ONLINE       0     0     0
        scsi-35000c500cacd0453                     ONLINE       0     0     0
        scsi-35000c500cacd3f6b                     ONLINE       0     0     0
        scsi-35000c500cacd0297                     ONLINE       0     0     0
      raidz2-8                                     ONLINE       0     0     0
        scsi-35000c500cacd4bcb                     ONLINE       0     0     0
        scsi-35000c500cacd36cf                     ONLINE       0     0     0
        scsi-35000c500cacd1983                     ONLINE       0     0     0
        scsi-35000c500cacd3aaf                     ONLINE       0     0     0
        scsi-35000c500cacda90b                     ONLINE       0     0     0
        scsi-35000c500cacd0d53                     ONLINE       0     0     0
        scsi-35000c500cacdaa1f                     ONLINE       0     0     0
      raidz2-9                                     ONLINE       0     0     0
        scsi-35000c500cacd3f13                     ONLINE       0     0     0
        scsi-35000c500cacd3187                     ONLINE       0     0     0
        scsi-35000c500cacd59a3                     ONLINE       0     0     0
        scsi-35000c500cacd0913                     ONLINE       0     0     0
        scsi-35000c500cacdf663                     ONLINE       0     0     0
        scsi-35000c500cacd156b                     ONLINE       0     0     0
        scsi-35000c500cacd203f                     ONLINE       0     0     0
      raidz2-10                                    ONLINE       0     0     0
        scsi-35000c500cacd4c97                     ONLINE       0     0     0
        scsi-35000c500cacd58a3                     ONLINE       0     0     0
        scsi-35000c500cacd2353                     ONLINE       0     0     0
        scsi-35000c500cacd3f67                     ONLINE       0     0     0
        scsi-35000c500cacd235f                     ONLINE       0     0     0
        scsi-35000c500cacdf14f                     ONLINE       0     0     0
        scsi-35000c500cacd2583                     ONLINE       0     0     0
      raidz2-11                                    ONLINE       0     0     0
        scsi-35000c500cacd2f87                     ONLINE       0     0     0
        scsi-35000c500cacdb557                     ONLINE       0     0     0
        scsi-35000c500cacd00f3                     ONLINE       0     0     0
        scsi-35000c500cacd3ea7                     ONLINE       0     0     0
        scsi-35000c500cacd23ff                     ONLINE       0     0     0
        scsi-35000c500cacd09d3                     ONLINE       0     0     0
        scsi-35000c500cacd3adb                     ONLINE       0     0     0
    logs    
      mirror-12                                    ONLINE       0     0     0
        nvme-eui.343842304db011100025384700000001  ONLINE       0     0     0
        nvme-eui.343842304db011060025384700000001  ONLINE       0     0     0
    cache
      nvme-eui.343842304db010920025384700000001    ONLINE       0     0     0
      nvme-eui.343842304db011080025384700000001    ONLINE       0     0     0
    spares
      scsi-35000c500cacdc907                       INUSE     currently in use
      scsi-35000c500cacd299f                       INUSE     currently in use
      scsi-35000c500cacd340b                       INUSE     currently in use
      scsi-35000c500cacd60ef                       INUSE     currently in use

errors: No known data errors

I have stopped the transfer, for obvious reasons, and am waiting for the resilvering to end before I replace the FAULTED and UNAVAIL drives. However I would like to know if the DEGRADED drives should be replaced? Also if anyone has an idea as to why this might happen?(Beyond the possibility of just a bad set of drives.) Or perhaps I just need to kill the pool and replace the drives. Either way I'm thinking the data will have to be copied once again.

Chris Woelkers
  • 298
  • 2
  • 11

1 Answers1

0

This problem was linked to one or two bad internal SAS cables within one of the two 4U JBOD cabinets. The cables in question went from the "primary" external SAS connector to the backplane. Swapping them out with two cables from the unused "secondary" external connector fixed the problem.

Chris Woelkers
  • 298
  • 2
  • 11