2

We have a server with PERC H740P mini (embedded) and a 2 disk RAID 1 with EXT4 for the OS (CentOS 7.8) and a 6 disk raidz2 ZFS on Linux setup for the data, all on the same controller.

It's generally considered bad® to run ZFS with HW RAID, but this controller doesn't seem to support a mixed RAID/non-RAID setup, so the 6 data drives (for ZFS) are all single disk RAID 0.

We see occasional ZFS panics that I suspect are due to the RAID controller interfering. Where can I read about the exact semantics of single disk RAID 0 for this controller so that I might be able to determine if it is the cause?

Are there any perccli64 incantations or other debuggery I could use to see what the controller might have been doing when ZFS pooped the proverbial bed?

sed_and_done
  • 183
  • 8
  • *It's generally considered bad® to run ZFS with HW RAID* LOL. As someone who was working closely with Sun systems and storage when ZFS came out, this continuation of over-the-top "ZFS izz da bestest!!!!" zealotry that began back in 2005 or so still amuses me. ZFS is a file system that has some great features. It's also just about the slowest file system out there and a giant memory hog. If ext4 were to panic on you, would you blame the hardware? Your boot drive on the same controller isn't pooping its bed on you, is it? That controller is just presenting a set of SCSI LUNs to the OS. – Andrew Henle Mar 21 '21 at 16:22
  • Try upgrading your PERC H740P firmware. There's this new [Enhanced HBA Mode](https://www.dell.com/support/manuals/en-us/poweredge-rc-h740p/perc10_ug_pub/enhanced-hba-mode?guid=guid-1c2cb930-a1d3-49dd-966a-eff7696812b7&lang=en-us) that supposed to allow you to present non-RAID disks to the host. – mforsetti Mar 21 '21 at 16:43
  • @AndrewHenle It isn't zealotry in my case, I think the software is definitely faulty for panicking in this scenario. I am just speaking of the notion that ZFS expects to communicate with disks at a low level, and that RAID interactions obscure some of that. Do you disagree that running ZFS on Linux above HW RAID is less good than a pass-thru mode or non-RAID controller? – sed_and_done Mar 21 '21 at 17:25
  • @mforsetti eHBA doesn't allow mixed RAID and non-RAID on the same controller it seems, which is why we used single-disk RAID-0 instead of eHBA – sed_and_done Mar 21 '21 at 17:25
  • @sed_and_done The "ZFS needs to communicate with the disks at a low level" is pretty much FUD that originated when Sun Microsystems was trying to break into the storage market with [Sun Open Storage](https://en.wikipedia.org/wiki/Sun_Open_Storage). Sun was dying and desperate for new markets. They had developed ZFS and were trying to figure out a way to leverage it into a revenue stream - some of the systems were ["Thumper" and "Thor"](https://en.wikipedia.org/wiki/Sun_Fire_X4500). Somehow the mindset of "ZFS won't work if you put it on hardware RAID" has stuck in the zealot community. – Andrew Henle Mar 21 '21 at 17:34
  • @AndrewHenle OK, but this same zealotry is being perpetuated by the upstream devs, so I think it's worth taking seriously https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Hardware.html#hardware-raid-controllers – sed_and_done Mar 21 '21 at 17:40
  • @sed_and_done That whole page assumes hardware RAID controllers are stupid, slow devices that only do simple pass-through, and the the person setting up the RAID array on the controller doesn't properly align the LUNs and/or partitions. It totally ignores the fact that hardware RAID means you pass less data over the PCI bus to do IO operations. ZFS mirrors use twice the PCI bandwidth. And it's funny how they say the way Linux can't present devices with consistent paths is a problem with hardware raid controllers and not the OS. But at least they finally lost the bit-rot FUD. – Andrew Henle Mar 21 '21 at 19:20
  • @sed_and_done resounding what [this answer](https://serverfault.com/a/1057793) said, you should be able to expose both VD and non-RAID disks to your host with eHBA, though from your configuration, you may need to remove any old configurations. You may want to backup your data twice first, better if you have a matching test environment. – mforsetti Mar 22 '21 at 06:39
  • @AndrewHenle I seen a case of metadata corruption yesterday on ZFS due to hardware RAID, the MegaRAID was in "HBA" mode, but the FreeBSD driver was mfi. So it was using controller RAM. Power outage, no BBU, entire pool lost. Not my pool thankfully. – Vinícius Ferrão Jul 19 '21 at 19:04
  • 1
    I now think this is a Use After Free in ZFS, probably has little to nothing to do with the HW unless some Hw accelerated compression or vectoring using some specialized CPU instructions come into play.. I think it might potentially be related to postgresql fsync stuff or other postgresql specific IO on the write path ZIO_WRITE_COMPRESS pipeline. Something is freeing a buffer that's being borrowed for a linear abd. Upstream is no help, so I am on my own debugging a crash dump with no repro on a foreign, very complex code base (OpenZFS) – sed_and_done Jul 19 '21 at 19:15
  • @ViníciusFerrão First, blaming a hardware RAID controller being run without BBU is like blaming the bowling ball you dropped on your foot for your broken toe. Second, RAID is not a backup, so restore from backup because any power surge can kill an entire data system at any time. The fact that the system managed to survive well enough so you could see your fried data is a bonus. – Andrew Henle Jul 19 '21 at 19:15
  • @AndrewHenle I didn't blamed the ctrl itself, but those kind of failures is more prone to happen when you have ZFS on top of HW RAID without BBU. ZFS is extremely hard to recover in comparison with other filesystems, and you know this. The question regarding the HW RAID is that vendors in general push as the standard, BBU is always optional instead of being shipped by default. I didn't say that RAID is a backup in any place on my message. You don't see other SDS solutions like: GlusterFS and CEPH pushing towards hardware RAID. Finally BSD historically have issues with the mfi/mr_sas driver. – Vinícius Ferrão Jul 19 '21 at 19:21

1 Answers1

1

I think it is difficult that the ZFS panics you are experiencing have anything to do with your hardware RAID controller. You should provide the exact panic / dmesg to let us understand what it is going.

That said, single-disk RAID0 disk is different than a non-RAID disk because:

  • the controller actually writes the required metadata for single-disk RAID 0
  • the controller write-back cache for RAID0 disks is enabled while for non-RAID disks is disabled

That said, your controller supports eHBA mode which, in turn, should pass unconfigured disks as non-RAID disks to the OS. From the docs, it seems that eHBA mode can be used concurrently for RAID0/1/10 arrays and non-RAID disks.

Try passing the ZFS disks as non-RAID drives and please report back.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • I am less confident that the controller is related now, but wondering if somehow the logbias=throughput ZFS setting is somehow doing a transactional log (ZIL) write in ZFS, such that the write is considered done/safe, and perhaps it's actually only in the RAID controller's cache, etc.. Ultimately the problem we are seeing is that the zio_t->io_abd->abd->abd_u's linear buffer ( abd_buf ) is getting freed prematurely, so we end up dereferencing an invalid pointer in the kernel while doing a write in the issue txg ( z_wr_iss ). It's a mess and the crash dump is hard to solve decipher. – sed_and_done Jul 10 '21 at 19:36
  • Any write is *always* transactional in ZFS, both if using `logbias=throughput` or `logbias=latency`. Again, I suggest you sharing your crash dump on the zfs mailing list. – shodanshok Jul 11 '21 at 11:31
  • I shared what I safely can in openzfs github issues. The zio.io_abd in the panicled thread has the same abd->abd_u.abd_linear.abd_buf as another process's zio, so I suspect there is a locking issue allowing the same buffer to get used by two separate zios. I wish I understood the ZFS concepts better, I can't formulate theories effectively without knowing how it should be working, which requires a lot of code reading. – sed_and_done Jul 12 '21 at 16:46