Testing RAID

Question

How does one fully evaluate a RAID configuration?

Pulling drives is one thing, but are there tools and techniques for more?

I've considered putting a nail through a running drive (powder actuated nailgun) to see what would happen, or simulating various electrical anomalies (shorts/opens in cable, power overloads and surges, etc).

What should be tested, and how?

-Adam

A shotgun is a popular option to test storage rack redundancy... albeit a bit expensive. — Oskar Duveborn, May 01 '09 at 17:26
If you decide to put a nail through a running hard drive... could you record it with a video camera? :p — Svish, Jun 03 '09 at 07:47
Oooh, hey, video taping it is a good idea! I doubt there'll be much more to see than a bang and an error on the screen, but people would watch it anyway... — Adam Davis, Jun 03 '09 at 17:38

score 7 · Accepted Answer · answered May 01 '09 at 17:44

In drives where hot-swap isn't an option, many raid controls (e.g. mdadm on linux) have a set-faulty command that simulates a drive failing.
In drives where hot-swap is okay, yank a drive!

I think your testing should cover the reasonable cases that you plan for. If you're trying to set up a server in the bush, then electrical fluctuations are reasonable test suites. If you're in a data center, the Service Agreement probably covers power.

If you think a drive wildly exploding inside a rack is reasonable - then test it. Maybe you're setting up a server in a command center in Baghdad. But once again, less likely if you're in Washington State.

As a general rule, your tests should cover all expected cases:

Drive is old and eventually goes bad (find a drive on its last legs, get it running, then pound it till it fails)
Drive fails a smart test but seems fine but you want to replace it just-in-case
General drive replacement because of size/performance upgrade or you just heard the batch was bad

And reasonable extreme cases.

Server suddenly losing power - okay.
Server itself being hit by lightning - not so much.
Rack falling over - okay.
Rack hit by truck - not so much.
Drive being jostled - okay
Drive being shot-putted - not so much.

And most importantly - RAID doesn't protect against drives silently corrupting data! So make sure you're doing hashes and file verification!

score 1 · Answer 2 · answered Jun 01 '09 at 20:01

It is indeed important to test a drive failing inelegantly if you care about the ultimate reliability of the overall solution. Every failed RAID solution (meaning the redundancy does not protect against failing drives) I have seen is due to the failure to test real drive failures. The normal test is to pull a drive, claim that drive failure has been tested, and move on.

The best solution is probably to have a collection of marginal drives, or modified firmware that causes inconsistent responses. Only storage vendors are reasonably likely to have this capability.

I like the idea of putting a nail through a running drive, but the forces on adjacent drives might result in an unrealistically catastrophic failure. Or the complete failure of the drive may result in an unrealistically clean failure.

If I was allowed to do legitimate testing of a RAID, I would destroy a few drives with varying means. Hook up wires to random components on the drive's board and fry them or short them. Indeed put a nail through a drive if the geometry of the enclosure makes this unlikely to destroy adjacent drives. (I think the resulting jostling of the remainder of the array is a reasonable test). Intercept a drive's data path and return every possible error, nonsensical results, or correct results delayed by random amounts of time.

Expect drives to return the wrong block sometimes. Expect drives to cause any conceivable electrical problem on their connection.

My experience is that no one considering a storage purchase wants to do real testing. This could expose real problems. I'd be very interested to hear if there is anyone who actually tests storage reliability - certainly they are not publishing their results.

Testing RAID

2 Answers2