5

So quick question - our RAID6 array is currently rebuilding and there is a VERY noticeable filesystem performance hit (home directories are NFS mounted on the array).

I'd sort of expect that, given you're rebuilding the array so there's massive read/write burden on the controller, but it occurred to me I don't really have anything to compare this to.

Is seeing serious (5-10 second freezes pretty frequently) an expected kind of behavior during RAID rebuilding coupled with heavy read/write usage (performance takes a noticeable hit during backups and when users are downloading large [multi GB] files via FTP).

Any thoughts on this would be appreciated. This is hardware RAID6 (LSI 9266-i8) on a 40TB array mounted over NFS locally (i.e. the server is physically very close to the workstations).

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
Alex
  • 451
  • 1
  • 5
  • 15
  • 2.8 Tb SATA disks – Alex Apr 23 '14 at 13:18
  • 2
    what's a backup? – Alex Apr 23 '14 at 13:29
  • [And, for the record, we have nightly backups to a second array as well as long-term/critical backups to a third array as well] – Alex Apr 23 '14 at 13:37
  • As advice: Check the settings of the firmware. Often (adactec) you can adjust rebuild priority. If that is high your normal IO (from usage) gets throttled quite hard - on the other setting rebuild may take longer but impact is lower. – TomTom Apr 23 '14 at 13:42
  • 1
    Wait, I just did the math on this. A 40TB array, using 3TB disks in RAID6 would mean that you have a 16 disk array. 16 high capacity, nearline SAS disks in RAID6 is not a safe configuration I'd feel comfortable using, FYI. You have both data loss risks and performance issues (as you're seeing) in this configuration. – HopelessN00b Apr 23 '14 at 13:49
  • So I agree that there is a risk here - however, we felt that having 2 (essentially mirrored) systems provided enough failsafe redundancy. i.e. should this array have a 2 disk failure we can replace the disks, rebuild from scratch and then repopulate from the backup. However, what would you recommend for a 40 TB (or 60/80 TB) server - RAID1 is appealing but I feel like you'd physically struggle to house enough disks? – Alex Apr 23 '14 at 15:35
  • My experience is certainly not the end all when it comes to this, but typically for our SANs we would setup two arrays (50% of the storage to each one) and make one of them RAID 10 (the best IMHO), and the other RAID 50. We'd use the RAID 10 LUNs for heavy write functions like db servers, Exchange Servers with database (vs. file) backends, etc. We'd use the RAID 50 for the OS of the server so essentially we'd have a server with 2 VMDKs/VHDs; one from each array. This worked great and you may want to think of some sort of mixing/matching for your next storage unit. – Brad Bouchard Apr 23 '14 at 17:33

1 Answers1

9

First, here is a great resource that outlines rebuild times.

RAID rebuilds and how they work pre and post failure.

Now, as far as my thoughts about the rebuild, we do know that rebuilds make for some really sluggish performance and rightfully so. As you will see from my link above, RAID rebuilds are not only extracting information from a failed disk to the good known disks (in the event of a post failure rebuild), they are also writing information to the system drive as well as other data/secondary drives all the while the server operates. Another thing to keep in mind is that usual functions that you would normally see take no time and relatively little resources at all now take more resources than normal and tax an already taxed server. In the event of a pre-rebuild failure (a little better on performance, but not much) You can get lucky and have a drive (logical or physical) fail and the RAID rebuild before end users (hopefully you as an SA should have some sort of alerting system so you shouldn't be surprised by it) even know anything had a problem.

The 5-10 second freezes you see are definitely normal and especially if the server you are rebuilding on is any kind of a database server that has higher than usual writes and reads by default (i.e. a SQL server that houses a database that end users access all day long; a property management company I used to consult for had a program that accessed their tenant records all day long for viewing and writing new information to them and it always had heavy usage.) it will be more noticeable.

Another thing I recommend is to get whatever RAID utility (the GUI version) comes with your controller and install it on the operating system so you can monitor the rebuild without having to load into a Controller BIOS.

A very small and almost non-existant issue these days is NFS vs iSCSI. I know you're using NFS and it used to be that iSCSI would have better overall performance in the case of virtualization, but with recent improvements to hypervisors and hard drives, as well as controllers, NFS is almost identical in performance to iSCSI so it sounds like you have a very nice SAN.

I'd be happy to answer anything else you need to know, so please feel free to comment.

Brad Bouchard
  • 2,527
  • 2
  • 13
  • 22