I saw it in a presentation a few weeks back, tried to implement it, failed and forgot about it. But now I wanna know how it works =)
It's a way of efficiently transfering/storing data. It would work in any language. This is what (I think) it does:
You have 1 very big file (eg entire javascript collection of a website).
- Split it in blocks of 48 bytes
- Hash every block of 48 bytes (eg. MD5)
- Split the list of blocks on hashes that end with 0x00
- The big blocks (>= 1 hash) should now be different sizes. Some very big, some very small.
- Glue the blocks between those hashes (again: very different sizes of actual data)
- Hash those blocks
- Now you have a list of hashes that represent the current version of the big file
The idea is that when a piece of code changes in the big file, only 1 or 2 hashes change. With the new file, you do all those above steps and you only upload/download the parts (blocks, identifieable by its hash) that have actyally changed. Depending on how much code was changed and on the size of the blocks surrounding that code, you'll never need to download more than 4 blocks. (Instead of the whole file.) The other end of the communication would then replace the original blocks (same algorithm, same functionality) with the new blocks.
Sound familiar? They mentioned a name, but couldn't find anything on it. When I tried to build it, it just didn't work, because if you don't change exactly 48 bytes [1], ALL the hashes after that change [2] are different...
If someonw knows the name: great. If someone could explain it also: perfect!
UPDATE
I found the presentation it was in. It was mentioned (and is used) in a new product "Silo": http://research.microsoft.com/apps/pubs/default.aspx?id=131524 Related: http://channel9.msdn.com/Events/MIX/MIX11/RES04 (So it actually was Microsoft research! Neat!)
From the first link:
A Silo-enabled page uses this local storage as an LBFS-style chunkstore.
In the second link (a video), the good stuff starts at 6:30
. Now I've seen it twice... I still don't get it =)
Keywords are Delta encoding
and Rabin fingerprints
.