0

We've experienced a failure of the hot-spare during reconstruction of a failed disk in a RAID5 array. As it seems, we actually lost some data over this, at least the storage bay is giving I/O read errors on some blocks.

The question is: Why can't the rebuild just start over with the next available hot-spare drive (there is more than one)?


So let me think: Let's assume a 5-disk RAID5 + hot-spares:

  1. All disks intact, all data is there and parity is there in case of emergency.
  2. One disk fails, parity on remaining 4 disks is used to rebuild failed 5th disk on hot-spare.
  3. During rebuild,
    • reads of blocks initially on the failed disk can be computed from parity
    • one of these blocks is modified in the meantime: parity on the remaining 4 disks has to change, in order to take this into account (the updated block has to be rewritten onto the hot-spare, in case it was already written, so some kind of changed blocks bitmap has to exist)
    • while data on disks 1-4 change, parity info on the hot-spare has to be rewritten every time

Now if the hot-spare fails during reconstruction, we still have the data from the 4 disks + the parity info, which would allow a new hot-spare to be used and start over.

The only thing I can think about right now would be not enough memory for a very large changed blocks bitmap (in case there were lots of writes during reconstruction).

What am I forgetting? (I've not tried implementing it :-P)

Marki
  • 2,854
  • 3
  • 28
  • 45

1 Answers1

0

Uh, never mind. The initial assumption was wrong: The hot spare didn't fail but another disk from the raid group failed during reconstruction.

The array kept the disk alive as long as it could but some sectors were inevitably lost (double parity error on a single parity raid).

Marki
  • 2,854
  • 3
  • 28
  • 45