Concerns of simultaneous failure of SSD Raid1

Question

Since SSD's have a write limit requiring wear leveling one would assume that all things being equal 2 identical SSD drives would wear nearly identically if they mirror data writes. When one drive fails you can assume the other one is holding on by minute differences in materials and logic. With write leveling not limiting physical location of bits by logical structure I would assume that a solution for this would be getting similarly sized, but not identical drives. For example I would mirror a 240gb drive with a 256gb drive. Even though logically I am not using that 16GB physical space, with write leveling the drive would not allow that area to be ignored. Or am I conflating the mechanics of write leveling?

Or should Raid1 just be avoided entirely in favor of a Raid5 or Raid6 with a Hot swap? With no identical data being written the wear leveling should be different across drives. In this case failure of a singe drive would not be an indication of immediate failure of the other drives. Even if it is, the fault tolerance of Raid6 in losing 2 drives should ameliorate those concerns. Though there would be a large processing hit to recalculate parity when swapping in a new drive, the IO speed of SSD would reduce the total rebuild time required over spinning media. Also the total space available, according to a raid calculator, from 4 256Gb drives would be the same if I went with Raid10 or Raid6. If I had the funds I would buy 8 drives and 2 raid cards and test both, see how failures occur, but I totally don't have the funds for that research.

Which way should I go in this, Raid10 or Raid6? Is there much documentation on simultaneous failure of mirrored identical SSD's? If so, does writing different data on each device with Raid6 protection against this, or does the quantity of data, not the shape of the data dictate drive wear? Does mismatched size offer some protection from this as the wear leveling will use all the available physical hardware instead of what is dictated by the Logical structure? And with the fast IO of SSD, does Raid6 become more attractive when rebuilding data on replacement drives?

Lordy. There's so much to address here... – ewwhite Jul 25 '13 at 00:58 — ewwhite, Jul 25 '13 at 00:58

score 3 · Answer 1 · answered Jul 24 '13 at 23:35

3

In practice, MTBF and write cycle tolerances of SSD are estimates not a killswitch. If a SSD is rated for a billion writes, it doesn't die at a billion and one writes. Combine wear-leveling algorithms, things like TRIM or other on-chip garbage collection and two SSDs dying minutes or even days apart from each other because of wear from writes would be rare.

You should be monitoring your hardware for preemptive failures anyway, so even if both disks were failing at the same time, you'd be able to replace both before a catastrophic failure from something like write wear.

answered Jul 24 '13 at 23:35

MDMarra

100,734
32
197
329

I understand, but when dealing with identical drives, with the same firmware, identical wear-leveling algorithms, writing identical data, etc. Where the only difference is manufacturing variations within very tight tolerances, Raid1 seems to be a recipe for simultaneous failure. – Tvanover Jul 24 '13 at 23:38
My entire answer explains why that's not the case. Did you read it? – MDMarra Jul 24 '13 at 23:39
Sure, but you're taking backups and you have a high-availability setup in your environment so at worst you have to source new hardware and restore from backups while your other HA pair server handles requests. – Joel E Salas Jul 24 '13 at 23:39
Of course we will be taking further precautions, SMART monitoring, regular backups, etc. This is not the entirety of our data protection, just a concern about an aspect of it. – Tvanover Jul 24 '13 at 23:39
@Tvanover The likeliness of hitting a double-drive failure with an SSD is comparable to mechanical drives. While exceedingly rare you still have plenty of warning before the drives go. SSDs have that advantage because they won't fail due to motor/spindle/head failures, so if you keep careful enough watch you're fine. – Nathan C Jul 25 '13 at 00:02
There is a fair amount of variability within individual NAND chips as well. Even if you write identical data to the drives, it's very unlikely they'll get to the point where the NAND is actually failing at the same time. What you should be doing is tracking the SSD's wearout figure through SMART (which is conservative, but also tied to the warranty of the drive), and replace your drives once you hit that threshold. This may mean you have to replace all your drives at once to stay within warranty, but it won't mean the drives actually all fail at once. – Daniel Lawson Jul 27 '13 at 12:25

Concerns of simultaneous failure of SSD Raid1

1 Answers1