10

This has been asked a million times but every time the answer is subject to "it depends on your requirements" condition, so I cannot extract a general guideline to apply to my case. So I ask again.

I have a 24 bays disk server (dual Xeon Silver 4210R with 128 GB RAM and CentOS 7) and 16 TB disks to store scientific data, organized in large files (size ~ GB) which are typically written once and then processed many times (the output of this processing does not matter for what follows). Data is mission critical but to some extent recoverable from other storage sites, so a failure with data loss is a big issue but probably not a killer. Available disk space should be maxed within the previous constraints. To summarize, in order of decreasing importance my constraints are:

  1. data integrity
  2. read performance
  3. available disk space

My tentative solution is to use a hardware RAID 60 level with two RAID 6 arrays of 12 disks each, and the ZFS filesystem. In my poor understanding, RAID 60 should provide a more reliable and read-performant solution than RAID 6 with a reasonable loss in terms of available space, and ZFS is a good choice for a fault-tolerant filesystem. I have no clue about possible downsides of this configuration (e.g. array rebuild time? A different filesystem?) nor about possible better alternatives, so I'd like to hear some informed opinion.

Thanks in advance for any suggestion.

Nicola Mori
  • 281
  • 1
  • 7
  • 2
    This is a great candidate for ZFS. Do you have the details of the specific hardware in use? Vendor/make/models. – ewwhite Jan 15 '21 at 12:28
  • 1
    @ewwhite It's a Supermicro 6049P-E1CR24H with 24 Seagate ST16000NM002G, dual Xeon Silver 4210R and 128 GB RAM. – Nicola Mori Jan 15 '21 at 13:36
  • The hardware solution should work. If you're really concerned with details, book a few hours with a ZFS consultant to help guide you through the process or design. It's better to have that type of design reviewed or signed-off on by an expert. – ewwhite Jan 16 '21 at 08:47
  • 2
    If you pick ZFS, I would suggest benchmarking with the last few LTS kernel versions, as your hardware has many features not supported in the GA kernel of CentOS7, which went EOL 3 years ago. You may also be able to use compression to improve disk performance at the expense of cpu if you have enough bandwidth to the server to make use of it. What sort of bandwidth do you have to the server? – Richie Frame Jan 17 '21 at 09:55
  • @RichieFrame The server has a dual 10 Gbps connection and a dual Xeon Silver 4210R (20 cores, 40 threads). Is it worth trying compression. Thanks also for the kernel tip. – Nicola Mori Jan 17 '21 at 10:12
  • @RichieFrame That's an opportunity for Debian. Go with stable (buster as of now) but install `zfs-dkms` from buster-backports. – iBug Jan 17 '21 at 14:07
  • 1
    @RichieFrame I thought about it but I have to integrate the disk server into a CentOS 7 computing farm. The IT guy won't be happy to see OS fragmentation. – Nicola Mori Jan 18 '21 at 08:17
  • you picking this up from thinkmate.com? – warren Jan 18 '21 at 23:11
  • 1
    @warren No, I bought it from an Italian company selling Supermicro products on the national market. – Nicola Mori Jan 20 '21 at 07:27
  • Does your RAID controller support HBA mode? With hardware RAID, you won't get the benefit of ZFS self-healing capabilities: https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Hardware.html#hardware-raid-controllers – Strepsils Jan 20 '21 at 07:41
  • @Strepsils I don't think so. This is my controller: https://www.supermicro.com/en/products/accessories/addon/AOC-S3108L-H8iR.php. As far as I see it does not support HBA mode. – Nicola Mori Jan 20 '21 at 07:48
  • @NicolaMori It seems you are right. From what I'm reading on Supermicro, it does not have the ability for IT mode and can't be flashed into it. However, I believe, it can be used as a JBOD but I haven't found specific info on this. – Strepsils Jan 20 '21 at 08:00
  • @NicolaMori It appears to be an LSI 3108-based HBA. See https://www.supermicro.com/support/faqs/faq.cfm?faq=28166 – Andrew Henle Jan 20 '21 at 12:07
  • @AndrewHenle Thanks, I found a similar hint on Reddit. Tomorrow I'm going to physically put my hands on the sever again, to play a bit with the controller BIOS options and try to set the JBOD mode. – Nicola Mori Jan 21 '21 at 09:09

4 Answers4

11

ZFS doesn't like to be on top of hardware RAID. Probably you might just use ZFS on raw disks, and configure it in raidz2 or raid60 mode. Also, probably it's good to have a replacement drive nearby, or even leave a hot spare(s) in the rack.

See the performance benchmarks here: https://calomel.org/zfs_raid_speed_capacity.html

Nikita Kipriyanov
  • 10,947
  • 2
  • 24
  • 45
8

For such a big setup (384 TB raw space) I strongly suggest using ZFS, as its data integrity (and repair) guarantees are simply too valuable to ignore.

If for "read performance" you mean sequential read speed, I would use a ZFS RAIDZ2 array configured with 2x 12-wide vdevs. Moreover, a large recordsize and lz4 compression should be two good choices. If going down that route, please keep in mind that it is generally better to avoid hardware RAID when using ZFS.

If you need high random read performance (unlikely, based on you description) you need to use smaller ZFS RAIDZ2 vdevs or even mirrors (if losing 50% free space is tolerable).

The non-ZFS alternative would be to use an hardware-based RAID60 array (having at least 2+ GB of powerloss-protected writeback cache) and a classical non-CoW filesystem (ie: XFS). In this case, you can use lvmthin as volume manager and snapshot layer. That said, go with ZFS if you can.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • Each client will read mostly sequentially but simultaneous access from several clients effectively result in random reads, right? If true then I'd avoid over-optimizing for sequential reads. Also, I'm a noob and didn't know that ZFS dislikes hardware RAID; has the ZFS+RAIDZ2 some disadvantage against hardware RAID + different FS (e.g. slower performance being a software RAID?). Would it be better to use a JBOD card instead of a RAID controller to setup ZFS+RAIDZ2? – Nicola Mori Jan 15 '21 at 13:34
  • @NicolaMori it is unlikely that multiple clients reading will turn large sequential reads in totally random, small ones. Moreover, ZFS has special provision for multi-stream prefetch reads. If your RAID controller supports pass-through operation, it should be fine with ZFS; otherwise, I suggest you to use a JBOD/HBA card. That said, performant storage at this scale is hard; you should really take your time to learn and experiment. – shodanshok Jan 15 '21 at 13:42
  • I'm not after the ultimate performance, just trying to avoid trivial errors and ending up with a largely under-performing setup, since I'm not an IT expert. I'll start by studying the ZFS+ZRAID2 setup as you suggested. – Nicola Mori Jan 15 '21 at 13:52
  • It seems that my controller allows only for exposing disks as RAID. Would creating 24 single-disk RAID0 arrays do the trick for using ZFS? – Nicola Mori Jan 15 '21 at 15:17
  • 1
    I would not use single-disk RAID0 arrays for simulating individual disk drives. However this is getting way off topic. I strongly suggest you to take your time to research, learn and experiment. If needed, talk to (or engage) a professional sysadmin. – shodanshok Jan 15 '21 at 16:42
  • @NicolaMori Don't bother with hardware RAID. It's not going to give you better performance than ZRAID. I would suggest replacing the onboard RAID controller with a different one which supports HBA/IT mode, or can be flushed to support it. It can even be an old SAS controller, nothing fancy is required. – ciamej Jan 15 '21 at 21:35
  • @ciamej *Don't bother with hardware RAID. It's not going to give you better performance than ZRAID.* I hate this cargo-cult worship of ZFS. Hardware RAID can and will beat ZFS RAID performance for a lot of reasons. I've been using ZFS ***professionally*** since it first came out on Solaris - and the ***last*** word I'd ever use to describe ZFS is "fast". ZFS is great, but it's ***not*** fast. Performance-wise, the most accurate word to describe ZFS is "pig". ZFS only runs fast if you throw tons of hardware at it. And ZFS on hardware RAID is common on Solaris installations. – Andrew Henle Jan 20 '21 at 10:59
  • @AndrewHenle I didn't mean just ZFS, but software RAID and similar solutions in general. The OP is concerned with read performance and I am pretty sure hardware raid does not hold any advantages in that regard. – ciamej Jan 20 '21 at 16:13
  • @AndrewHenle each RAIDZ top-level vdev provides the same IOPs as a single component device so yes, it is slow and should be used only if you really know it is appropriate. On the other size, ZFS mirrors (ie: RAID10) are quite fast and, while coupled with ARC/L2ARC, can provide better performance than hardware RAID10 (for HDD-based array, at least). Anyway, I agree that ZFS focus is data safety, rather than maximum speed (which CoW filesystems rarely achieve). – shodanshok Jan 20 '21 at 17:15
7

Another recommendation is to have your OS separate to your data disks.

That supermicro chassis has two additional slots in the rear for 2.5" SATA disks. These should be RAID1 and contain the OS and any swap. The 24 disks out the front should just be for data in whatever RAID array or ZFS setup you choose.

enter image description here

Criggie
  • 2,379
  • 14
  • 25
  • 1
    I know its not really an answer, but can't put a picture in a comment. Also this does separate the OS raid from the data, which simplifies things. – Criggie Jan 16 '21 at 07:44
  • 1
    Agreed about not answering the question, but definitely agree on keeping data and processing on separate drives - looking at replacing my NAS later this year, but currently running everything via an external SSD (that's backed up as not RAID) specifically to solve this! – Rycochet Jan 17 '21 at 09:51
  • 1
    Thanks, I already went down that way with a 240 GB SSD in the rear bay for the OS. – Nicola Mori Jan 18 '21 at 08:19
1

I know you'd lose more capacity to parity but I'd personally go with R60 using 3 x 8 disk arrays, simply for the rebuild time, it won't benefit you in any way, but 12 x 16TB disks is a bit much for me personally - yes it'll work.

The other option given you want to use ZFS is to use ZRAID, I'm no expert but there are several here who are.

Chopper3
  • 101,299
  • 9
  • 108
  • 239