ZFS keeps faulting the same device

Question

Our storage server has some problems, just ago we had a broken disk (WD 4TB RE SAS), though the raidcontroller (LSI MegaRAID 9271-8i) kept the disk online (status : ok). Only Media error showed 1 error. We decided to be save and replace the disk, during the resilvering a second and a tirth disk where marked as (resilvering) although only one disk showed 1 read error. Today the resilver is complete (no corruption, all oke), I started a scrub and was met with this :

zpool status
  pool: data
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub in progress since Thu Jan 14 10:50:00 2016
    2.71T scanned out of 111T at 718M/s, 43h59m to go
    0 repaired, 2.44% done
config:

        NAME                     STATE     READ WRITE CKSUM
        data                     DEGRADED     0     0     0
          raidz2-0               DEGRADED     0     0     0
            br0c2                ONLINE       0     0     0
            br1c2                ONLINE       0     0     0
            br2c2                ONLINE       0     0     0
            br0c3                ONLINE       0     0     0
            br1c3                ONLINE       0     0     0
            br2c3                ONLINE       0     0     0
            r2c1                 ONLINE       0     0     0
            r1c2                 ONLINE       0     0     0
            r5c3                 ONLINE       0     0     0
            sdb                  ONLINE       0     0     0
            sdc                  ONLINE       0     0     0
            7196084230607724634  FAULTED      0     0     0  was /dev/sdai1
            r5c0                 ONLINE       0     0     0
            r0c1                 ONLINE       0     0     0
            r1c1                 ONLINE       0     0     0
            r3c1                 ONLINE       0     0     0
            r4c1                 ONLINE       0     0     0
          raidz2-1               ONLINE       0     0     0
            r5c1                 ONLINE       0     0     0
            r0c2                 ONLINE       0     0     0
            r2c2                 ONLINE       0     0     0
            r3c2                 ONLINE       0     0     0
            r4c2                 ONLINE       0     0     0
            r5c2                 ONLINE       0     0     0
            r0c3                 ONLINE       0     0     0
            r1c3                 ONLINE       0     0     0
            r2c3                 ONLINE       0     0     0
            r3c3                 ONLINE       0     0     0
            r4c3                 ONLINE       0     0     0
            br0c0                ONLINE       0     0     0
            br1c0                ONLINE       0     0     0
            br2c0                ONLINE       0     0     0
            br0c1                ONLINE       0     0     0
            br1c1                ONLINE       0     0     0
            br2c1                ONLINE       0     0     0

errors: No known data errors

/dev/sdai1 is online and the raidcontroller is not showing any error (not even media error). Can I try and put the disk online / offline ?

update : I try'd to deatach the disk but it refuses to do so, I was under to impression that raidz2-0 has 2 parity disks (and raidz2-1 also 2) so why can't I detach ?

zpool detach data 7196084230607724634
cannot detach 7196084230607724634: only applicable to mirror and replacing vdevs

This looks like a complete mess! Are you using multiple RAID0 arrays comprised of a single disk through your RAID controller, then presenting them to ZFS?!? — ewwhite, Jan 14 '16 at 12:38
You already told me, as a noob I followed [source](https://calomel.org/megacli_lsi_commands.html) we are currently in process of making a backup and recreating everything. So far ZFS has been nothing then problems. — SvennD, Jan 14 '16 at 12:46
Recreating really is the best thing you can do! Be sure to follow some guides when re-creating. Only put 6 or 8 disks in a raidz2 and just create multiple raidz2. Also, if you can, get rid of the RAID Controller. The more ZFS knows about the HW, the better it will work. I run 96 disks with ZFS (Solaris, not Linux) and didn't have a single Problem (Other than some faulted disks, but that's normal) in more than 5 Years! Some things to read: https://docs.oracle.com/cd/E23823_01/html/819-5461/zfspools-4.html — embedded, Jan 14 '16 at 13:18
The thing that got me at this point was a guide and pretty well written out guide at that. (6*4)-8 is just not economical feasible at this point. That would be over 10 disk lost for parity ... — SvennD, Jan 14 '16 at 13:24
Then go with 8 Disks per raidz2 and get 75% Space efficiency. Economical feasible is what never makes Problems. Your "wasting" some disks, but how much is it costing if you have to spend hours on fixing bad configurations? Use raidz and never "normal Raids", and never deal with it again: No write holes, no unrecoverable read errors, no silent data corruption. — embedded, Jan 14 '16 at 13:38

score 2 · Accepted Answer · answered Jan 14 '16 at 11:52

2

Why are you passing the disks through a RAID Controller? JBOD would make more sense when using ZFS. You could run into Problems because of your controller.

Anyway, its save to just detach and re-attach the disk. You could also try to replace the disk (without really replacing it: zpool replace pool disk)

Let it resilver and scrub again.

answered Jan 14 '16 at 11:52

embedded

466
2
6
19

There is no JBOD option on this raid controller, RAID0 on every disk was the closest to JBOD... – SvennD Jan 14 '16 at 12:38
*Why are you passing the disks through a RAID Controller?* The disk in question is `/dev/sdai1`. There are 17 drives in each `raidz2` array for a total of 34 drives in the pool. How would you recommend attaching that many drives? – Andrew Henle Jan 14 '16 at 12:38
2 * 24 Disk Enclosure. SAS Connection to the Storage Host with 2 SIMs so you can multipath. – embedded Jan 14 '16 at 12:48
I was wrong, it was not the same device after all, panic is a bad teacher. Linux did name it /dev/sdai but that is after a reboot so the names got changed. I accepted your answer as the last sentence is on mark : resilver and scrub. The disk however does not seem broken. But zfs is smarter then a raidcontroller. – SvennD Jan 20 '16 at 20:03
@SvennDhert Use /dev/disk/by-id or /dev/disk/by-vdev for import if you want names to remain the same across reboots. Configure them to point to persistent identifiers, not directly to the /dev/sdxyz device nodes. With that many disks, detection order is a coin-toss. – user Mar 07 '16 at 21:33

ZFS keeps faulting the same device

1 Answers1