3

zpool status is reporting defaulted drives. But they actually appear to be okay. Is it possible to add them back?

$ dev/disk# zpool status -v
  pool: darkpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub in progress since Fri Nov  8 04:52:09 2019
    1004G scanned out of 47.5T at 81.4M/s, 166h22m to go
    0B repaired, 2.06% done
config:

    NAME                          STATE     READ WRITE CKSUM
    darkpool                      DEGRADED     0     0     0
      raidz3-0                    DEGRADED     0     0     0
        wwn-0x5000c5008581aafb    ONLINE       0     0     0
        wwn-0x5000c5008581b61b    ONLINE       0     0     0
        783034318520267027        FAULTED      0     0     0  was /dev/sdm1
        7369503050985789936       FAULTED      0     0     0  was /dev/sdj1
        wwn-0x5000c5008581b953    ONLINE       0     0     0
        wwn-0x5000c5008581bdf7    ONLINE       0     0     0
        wwn-0x5000c50085825ec7    ONLINE       0     0     0
        11744243917579175290      FAULTED      0     0     0  was /dev/sdg1
        wwn-0x5000c5008581e423    ONLINE       0     0     0
        wwn-0x5000c5008581fd3f    ONLINE       0     0     0
        wwn-0x5000c50085820b93    ONLINE       0     0     0
        wwn-0x5000c500858211b3    ONLINE       0     0     0
        wwn-0x5000cca267ab0de4    ONLINE       0     0     0
        spare-13                  DEGRADED     0     0     0
          11992420879588183985    FAULTED      0     0     0  was /dev/disk/by-path/pci-0000:03:00.0-scsi-0:0:10:0-part1
          wwn-0x5000c500858252ef  ONLINE       0     0     0
    spares
      wwn-0x5000c500858252ef      INUSE     currently in use

Faulted Drives Seem Fine

$ sudo smartctl --all /dev/sdm1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-66-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST8000NM0075
Revision:             PS24
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50085820b93
Serial number:        ZA12CVG1
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Nov  8 10:26:20 2019 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     58 C
Drive Trip Temperature:        60 C

Manufactured in week 23 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  148
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1344
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 2633993520
  Blocks received from initiator = 313335416
  Blocks read from cache and sent to initiator = 3189766298
  Number of read and write commands whose size <= segment size = 373006550
  Number of read and write commands whose size > segment size = 142985

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 28987.73
  number of minutes until next internal SMART test = 48

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   574211145      105         0  574211250        105     242574.514           0
write:         0        0        17        17         17      18073.098           0
verify:   252916        0         0    252916          0          0.526           0

Non-medium error count:     1269

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  96       4                 - [-   -    -]
# 2  Reserved(7)       Completed                  64       4                 - [-   -    -]

Long (extended) Self Test duration: 47220 seconds [787.0 minutes]

$ sudo smartctl --all /dev/sdj1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-66-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST8000NM0075
Revision:             PS24
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50085823d2b
Serial number:        ZA12BNXA
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Nov  8 10:26:24 2019 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     47 C
Drive Trip Temperature:        60 C

Manufactured in week 23 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  148
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1364
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 4179446744
  Blocks received from initiator = 2703674280
  Blocks read from cache and sent to initiator = 2799660441
  Number of read and write commands whose size <= segment size = 334518430
  Number of read and write commands whose size > segment size = 131599

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 28987.73
  number of minutes until next internal SMART test = 43

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   4216128253        9         0  4216128262          9     214344.135           0
write:         0        0         4         4          4      17073.614           0
verify:   269974        0         0    269974          0          0.562           0

Non-medium error count:      570

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  96       4                 - [-   -    -]
# 2  Reserved(7)       Completed                  64       4                 - [-   -    -]

Long (extended) Self Test duration: 47220 seconds [787.0 minutes]

$ sudo smartctl --all /dev/sdg1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-66-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST8000NM0075
Revision:             PS24
Compliance:           SPC-4
User Capacity:        8,001,563,222,016 bytes [8.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5008581aafb
Serial number:        ZA12CXW2
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Nov  8 10:26:28 2019 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     59 C
Drive Trip Temperature:        60 C

Manufactured in week 23 of year 2016
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  148
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1334
Elements in grown defect list: 0

Vendor (Seagate) cache information
  Blocks sent to initiator = 2845390680
  Blocks received from initiator = 1453787448
  Blocks read from cache and sent to initiator = 3178782010
  Number of read and write commands whose size <= segment size = 376760133
  Number of read and write commands whose size > segment size = 148599

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 28987.77
  number of minutes until next internal SMART test = 39

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   704945336        2         0  704945338          2     244917.683           0
write:         0        0        73        73         73      18665.495           0
verify:   320880        0         0    320880          0          0.667           0

Non-medium error count:     1242

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  96       4                 - [-   -    -]
# 2  Reserved(7)       Completed                  64       4                 - [-   -    -]

Long (extended) Self Test duration: 47220 seconds [787.0 minutes

They're all here

Current: 


sda     wwn-0x5000c500858211b3  
sdb     wwn-0x5000c5008581b953  
sdc     wwn-0x5000c50085825ec7  
sdd     wwn-0x5000c5008581e423  
sdf     wwn-0x5000c5008581b61b  
sdg     wwn-0x5000c5008581aafb  *
sdh     wwn-0x5000c5008581cc03  *
sdi     wwn-0x5000cca267ab0de4      
sdk     wwn-0x5000c5008581b933  *
sdl     wwn-0x5000c5008581bdf7  *
sdm     wwn-0x5000c50085820b93  *
sdn     wwn-0x5000c5008581b79f  *
sdo     wwn-0x5000c500858252ef  *
sdp     wwn-0x5000c5008581fd3f  
sdq     wnn-0x61866da05f3bc2001f1c1a0d117e72cf
Louis Waweru
  • 755
  • 1
  • 9
  • 29
  • 1
    I feel like we're missing information on what happening *right* before this... – ewwhite Nov 08 '19 at 15:42
  • @ewwhite This was the last thing to happen to the pool: https://serverfault.com/questions/985409/zpool-replace-ran-successfully-but-still-recommends-zpool-replace-what-is-it-t – Louis Waweru Nov 08 '19 at 15:45

1 Answers1

2

What's in the kernel ring buffer? Can you post relevant snippets of dmesg -T?

Try a zpool clear to try to clear the transient errors.

Are these all SAS disks? Or do you have SATA mixed into this environment?


Edit the device timeouts for the SATA drives for reasons.

echo 180 > /sys/block/sdX/device/timeout where sdX is the device.

Then run a zpool clear and see if things resilver properly.

ewwhite
  • 197,159
  • 92
  • 443
  • 809