Can someone explain how Remote Differential Compression works?

Question

I know that RDC is how the Distributed File System Replication (DFSR) keeps data on all shared devices in an active directory synced together. I only understood that RDC splits data into chunks, then it hashes each one of those chunks into what's called a signature. The set of signatures is transferred from server to client. The client compares the server signatures to its own. The client then requests the server send only the data for signatures that are not already on the client.

What I don't understand is this quote from Microsoft:

"RDC divides a file's data into chunks by computing the local maxima of a fingerprinting function that is computed at every byte position in the file. A fingerprinting function is a hash function that can be computed incrementally. For example, if you compute the function F over a range of bytes from the file, Bi...Bj, it should then be possible to compute F(Bi+1...Bj+1) incrementally by adding the byte Bj+1 and subtracting the byte Bi. The range of bytes from the file, Bi...Bj, is called the hash window. The length of this window, in bytes, is called the hash window size.

The RDC library's FilterMax signature generator "slides" the hash window across the entire file by adding the byte at the leading edge and subtracting the byte at the trailing edge of the window. Meanwhile, the generator continually examines the sequence of fingerprint function values over a given range of bytes, called the horizon size. If a fingerprint function value is a local maximum within the range, its byte position is chosen as a "cut point," or chunk boundary.

After the file has been divided into chunks, the signature generator computes a strong hash value (an MD4 hash), called a signature, for each chunk. The signatures can be used to compare the contents of two arbitrarily different versions of a file.

Because the size of the signature file grows linearly with the size of the original file, comparing very large files can be expensive. This cost is reduced dramatically by applying the RDC algorithm recursively to the signature files. For example, if the original file size is 9 GB, the signature file size would typically be about 81 MB. If the RDC algorithm is applied to the signature file, the resulting second-level signature file size would be about 5.7 MB."

What I don't understand is two things: What does any of this "can be computed incrementally" stuff have to do with how RDC works? And how does recursivity help reduce bandwidth?

I just want knowledge about how this works. Why it's less bandwidth heavy than FSR. — , Sep 29 '16 at 21:40

score 1 · Accepted Answer · answered Sep 29 '16 at 22:04

The part about "Incrementally" is simply saying that the Hash Window can "slide" by taking a byte off the front of the window, and adding the next byte to the end of the window. Thus the window can slide incrementally from the beginning of a file to the end in order to detect "shifts" between instances of a file. Say, for instance, you have a Text Document. The fingerprints are generated from the blocks of data from that Text Document. Then, at a later time, you add a paragraph of text to the beginning of that Text Document. The window could start at the beginning, and increment through the file until it matches a block for which it already has a fingerprint.

With regards to the part about recurisivity, say for instance you have a block of data comprised by the bytes "ABCD" and another block comprised by the bytes "GHIJ." Each block might have a fingerprint of "01" and "02," or four bytes. Instead of transmitting all four bytes, the algorithm takes a fingerprint of "0102" (both fingerprints together), which might produce a fingerprint of "03." If the destination file has the same fingerprint-of-fingerprints, then it can be assumed that all of the underlying blocks are unchanged, and do not need to be transmitted.

This isn't correct. The fingerprint over a small window of bytes is computed incrementally as described - but it isn't used to match anything. It is used to find a cut point between chunks. Then the chunks are hashed (with a cryptographic hash) and _that_ hash is used to identify that chunk and determine if you've already transmitted it or not. The fingerprint is a deterministic function of the bytes; thus the cut points for a file (given a window size) are deterministic - that is, both sides of the transmission, if operating over the same file, will determine the same cut points. — davidbak, Nov 05 '20 at 02:57

Can someone explain how Remote Differential Compression works?

1 Answers1