This is a bit of an older question, but (in my mind) is always something that is useful and in some cases necessary. Bit rot will never be completely cured (hush ZFS community; ZFS only has control of what's on it's filesystem while it's there), so we always have to come up with proactive prevention and recovery plans.
While it was designed to facilitate piracy (specifically storing and extracting multi-GB files in chunks on newsgroups where any chunk could go missing or be corrupted), "Parchives" are actually exactly what you're looking for (see the white paper, though don't implement that scheme directly as it has a bug and newer schemes are available), and they work in practice as follows:
- The complete file is input in to the encoder
- Blocks are processed and Reed-Solomon blocks are generated
.par
files containing those blocks are output along side the original file
- When integrity is checked (typically on the other end of a file transfer), the blocks are rechecked and any blocks that need to be used to reconstruct missing data are pulled from the
.par
files.
Things eventually settled in to "PAR2" (essentially a rewrite with additional features) with the following scheme:
- Large file compressed with RAR and split in to chunks (typically around 100MB each as that was a "usually safe" max of usenet)
- An "index" file is placed along side the file (for example
bigfile.PAR2
). This has no recovery chunks.
- A series of par files totaling 10% of the original data size are along side in increasingly larger filesizes (
bigfile.vol029+25.PAR2
, bigfile.vol104+88.PAR2
, etc)
- The person on the other end can then gets all
.rar
files
- An integrity check is run, and returns a MB count of out how much data needs recovery
.PAR2
files are downloaded in an amount equal to or greater than the need
- Recovery is done and integrity verified
- RAR is extracted, and the original file is successfully transferred
Now without a filesystem layer this system is still fairly trivial to implement using the Parchive tools, but it has two requirements:
- That the files do not change (as any change to the file on-disk will invalidate the parity data (of course you could do this and add complexity with a copy-on-change writing scheme))
- That you run both the file generation and integrity check/recovery when appropriate.
Since all the math and methods are both known and battle-tested, you can also roll your own to meet whatever needs to have (as a hook in to file read/write, spanning arbitrary path depths, storing recovery data on a separate drive, etc). For initial tips, refer to the pros: https://www.backblaze.com/blog/reed-solomon/
Edit: The same research that led me to this question led me to a whole subset of already-done work that I was previously unaware of