3

I've been researching rabin fingerprinting for the past couple of days. While the general idea is simple enough I'm having significant troubles understanding the implementations that are circulating around the net. In particular all of them seem to be derived from the original LBFS paper, namely from librabinpoly the sliding window is defined as:

33 static u_int64_t slide8(RabinPoly *rp, unsigned char m) {                       
   34         rp->circbuf_pos++;                                                      
   35         if (rp->circbuf_pos >= rp->window_size) {                               
   36                 rp->circbuf_pos = 0;                                            
   37         }                                                                       
   38         unsigned char om = rp->circbuf[rp->circbuf_pos];                        
   39         rp->circbuf[rp->circbuf_pos] = m;                                       
   40         return rp->fingerprint = append8 (rp, rp->fingerprint ^ rp->U[om], m);  
   41 }                                                                               
   42                                                                                 
   43 static u_int64_t append8(RabinPoly *rp, u_int64_t p, unsigned char m) {         
   44         return ((p << 8) | m) ^ rp->T[p >> rp->shift];                          
   45 }                

Where the U/T tables are generated from the initial polynomial. I haven't seen in any of the papers pertaining to rabin fingerprinting to discuss the usage of those 2 tables and the XOR operations. My gut feeling is this has something to do with the modulo arithmetics but I'm not entirely sure. Git's source code also uses rabin fingerprinting but instead of deriving the tables dynamically they have a set of pre-computed ones. So my question is - what exactly do those Xor operations achieve and the code generally looks fairly different than the 'canonical' explanation of the algorithm

lrd dsk
  • 31
  • 1

1 Answers1

1

The "canonical explanation" uses the rolling hash that is not a Rabin fingerprint. It's pretty similar, though. Without getting too deep into the weeds of abstract algebra, the idea behind both is to evaluate a polynomial derived from the message in a particular ring, which has 0, 1, addition, subtraction, multiplication but not division (integers mod m for the canonical explanation; GF(2k) for Rabin fingerprints, which is to say, polynomials with coefficients mod 2, modulo an irreducible polynomial of degree k).

The simplest ring is the integers mod 2, which has 0, 1 and defines

+  0 1        -  0 1        *  0 1
------        ------        ------
0  0 1        0  0 1        0  0 0
1  1 0        1  1 0        0  0 1  .

A very interesting thing happens: plus and minus have the same definition, and both are equivalent to XOR. Using a computer word to represent a polynomial with coefficients mod 2, we can add and subtract polynomials by using bitwise XOR. That's why XOR appears in rp->fingerprint ^ rp->U[om]: we're subtracting out the term from the byte that just left the window, using the U table since there are only 256 possibilities for that term.

The other use of XOR, ((p << 8) | m) ^ rp->T[p >> rp->shift], is in an expression that's modding by the irreducible polynomial, i.e., the equivalent of modding by m in the canonical explanation. If we were to do this by polynomial long division (how the T table gets computed in the first place, presumably), we'd notice that terms subtracted (in the ring) from the dividend are determined by the high-order bits alone (p >> rp->shift). A little algebraic manipulation later, we can cache the sum (in the ring) and subtract it (in the ring, so bitwise XOR) from the dividend (((p << 8) | m)).

For completeness, note that p << 8 is the equivalent of polynomial multiplication by x8.

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120