We've experienced a failure of the hot-spare during reconstruction of a failed disk in a RAID5 array. As it seems, we actually lost some data over this, at least the storage bay is giving I/O read errors on some blocks.
The question is: Why can't the rebuild just start over with the next available hot-spare drive (there is more than one)?
So let me think: Let's assume a 5-disk RAID5 + hot-spares:
- All disks intact, all data is there and parity is there in case of emergency.
- One disk fails, parity on remaining 4 disks is used to rebuild failed 5th disk on hot-spare.
- During rebuild,
- reads of blocks initially on the failed disk can be computed from parity
- one of these blocks is modified in the meantime: parity on the remaining 4 disks has to change, in order to take this into account (the updated block has to be rewritten onto the hot-spare, in case it was already written, so some kind of changed blocks bitmap has to exist)
- while data on disks 1-4 change, parity info on the hot-spare has to be rewritten every time
Now if the hot-spare fails during reconstruction, we still have the data from the 4 disks + the parity info, which would allow a new hot-spare to be used and start over.
The only thing I can think about right now would be not enough memory for a very large changed blocks bitmap (in case there were lots of writes during reconstruction).
What am I forgetting? (I've not tried implementing it :-P)