Reed-Solomon in file recovery

Question

A piece of software I'm working on outputs quite a lot of files which are the stored on a server. During its runtime I've had one file go corrupt on me. These files are critical to the operation, so this cannot happen. I'm therefore trying to come up with a way of adding error correction to the files to prevent this from ever happening again.

I've read up on Reed-Solomon, which encodes k blocks of data plus m blocks of parity, and can then reconstruct up to m missing blocks. So what I'm thinking is taking the data stream, split it into these blocks, and then store them in sequence on disk, first the data blocks, then the parity blocks. Repeat until entire file is stored. k, m, and block sizes are of course variables I'll have to investigate and play with.

However, it's my understanding that Reed-Solomon requires you to know which blocks are corrupt. How could I possibly know that? My thinking is I'd have to add some extra, simpler, error detection code to each of the blocks as I write them, otherwise I can't know if they're corrupted. Like CRC32 or something.

Have I understood this correctly, or is there a better way to accomplish this?

Are you expecting to have data blocks go missing completely, so that the position of data after that points shifts forward, or are you only expecting to have data blocks get corrupted without changing size? If it's the later then you don't need to know which blocks are corrupted. — 101, Nov 03 '16 at 07:29
@101 Ideally I'd like to be able to guard against both scenarios, though I'm not sure how I would accomplish the first. Why would I not need to know which blocks are corrupted? If I have three data blocks and two parity blocks, and one of them contain corrupted data, I'd have to know which one it was before I could reliably reconstruct the data, no? — KennethJ, Nov 05 '16 at 00:55
No, you just feed back all of the data blocks to the RS decoder and it will either decode the original message or let you know that there is too many corrupt blocks. You shouldn't need to know more than that. — 101, Nov 06 '16 at 19:13

joshfindit · Answer 1 · 2020-06-02T23:52:11.097

This is a bit of an older question, but (in my mind) is always something that is useful and in some cases necessary. Bit rot will never be completely cured (hush ZFS community; ZFS only has control of what's on it's filesystem while it's there), so we always have to come up with proactive prevention and recovery plans.

While it was designed to facilitate piracy (specifically storing and extracting multi-GB files in chunks on newsgroups where any chunk could go missing or be corrupted), "Parchives" are actually exactly what you're looking for (see the white paper, though don't implement that scheme directly as it has a bug and newer schemes are available), and they work in practice as follows:

The complete file is input in to the encoder
Blocks are processed and Reed-Solomon blocks are generated
.par files containing those blocks are output along side the original file
When integrity is checked (typically on the other end of a file transfer), the blocks are rechecked and any blocks that need to be used to reconstruct missing data are pulled from the .par files.

Things eventually settled in to "PAR2" (essentially a rewrite with additional features) with the following scheme:

Large file compressed with RAR and split in to chunks (typically around 100MB each as that was a "usually safe" max of usenet)
An "index" file is placed along side the file (for example bigfile.PAR2). This has no recovery chunks.
A series of par files totaling 10% of the original data size are along side in increasingly larger filesizes (bigfile.vol029+25.PAR2, bigfile.vol104+88.PAR2, etc)
The person on the other end can then gets all .rar files
An integrity check is run, and returns a MB count of out how much data needs recovery
.PAR2 files are downloaded in an amount equal to or greater than the need
Recovery is done and integrity verified
RAR is extracted, and the original file is successfully transferred

Now without a filesystem layer this system is still fairly trivial to implement using the Parchive tools, but it has two requirements:

That the files do not change (as any change to the file on-disk will invalidate the parity data (of course you could do this and add complexity with a copy-on-change writing scheme))
That you run both the file generation and integrity check/recovery when appropriate.

Since all the math and methods are both known and battle-tested, you can also roll your own to meet whatever needs to have (as a hook in to file read/write, spanning arbitrary path depths, storing recovery data on a separate drive, etc). For initial tips, refer to the pros: https://www.backblaze.com/blog/reed-solomon/

Edit: The same research that led me to this question led me to a whole subset of already-done work that I was previously unaware of

https://crates.io/crates/solana-reed-solomon-erasure (as well as a bunch of other implementations in the Rust crate registry)
https://github.com/klauspost/reedsolomon (based on the BackBlaze code, and processes 1Gbps per core)
Etc. Look for "Reed-Solomon file recovery "

Reed-Solomon in file recovery

1 Answers1