Hard drive in zpool showed error, but later seemed ok. How do I tell if something is wrong?

Question

My work computer has a 4 hard drives setup in a zpool on an Ubuntu system. I'm trained as a programmer, not IT, but I'm partially responsible for managing my computer. After rebooting the other day, I noticed the pool wasn't mounted, and this was the output of the zpool status command:

pool: zhoupool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 1h48m with 0 errors on Sun Mar 12 03:12:25 2017
config:

    NAME                                 STATE     READ WRITE CKSUM
    zhoupool                             DEGRADED     0     0     0
      mirror-0                           ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GM2P  ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GMZ3  ONLINE       0     0     0
      mirror-1                           DEGRADED     0     0     0
        11645674422250617741             UNAVAIL      0     0     0  was /dev/disk/by-id/ata-ST3000DM001-1ER166_Z500GP0C-part1
        ata-ST3000DM001-1ER166_Z500GVM5  ONLINE       0     0     0

errors: No known data errors

I intended to replace the hard drive, however I noticed later that the pool had been mounted (the machine was restarted at least once since the initial error). The zpool status output was now:

 pool: zhoupool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 31.0G in 2h10m with 0 errors on Sun May 14 02:34:46 2017

config:

    NAME                                 STATE     READ WRITE CKSUM
    zhoupool                             ONLINE       0     0     0
      mirror-0                           ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GM2P  ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GMZ3  ONLINE       0     0     0
      mirror-1                           ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GP0C  ONLINE       0     0  258K
        ata-ST3000DM001-1ER166_Z500GVM5  ONLINE       0     0     0

errors: No known data errors

This still indicated an error, so I was still working on ordering a new hard drive to replace it. However I notice now that the zpool status doesn't indicate any errors:

  pool: zhoupool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 2h11m with 0 errors on Sun Jul  9 02:35:48 2017
config:

    NAME                                 STATE     READ WRITE CKSUM
    zhoupool                             ONLINE       0     0     0
      mirror-0                           ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GM2P  ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GMZ3  ONLINE       0     0     0
      mirror-1                           ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GP0C  ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z500GVM5  ONLINE       0     0     0

errors: No known data errors

So should I still be concerned? Was there actually a hard drive failure, or was it some software hiccup that caused the errors? How do I diagnose this?

score 1 · Accepted Answer · answered Aug 11 '17 at 18:13

Your data should be safe. It looks like the scrub on 5/14 cleaned things up and the following scrubs ran clean. Check dmesg to see if that device is spitting timeouts/errors.

You should be using smartmontools to collect SMART data from the drives, check status, and run occasional online checks. (Here is a decent write up: https://www.howtoforge.com/checking-hard-disk-sanity-with-smartmontools-debian-ubuntu) Chances are that this won't be the last time that drive acts up.

Hard drive in zpool showed error, but later seemed ok. How do I tell if something is wrong?

1 Answers1