0

I understand the algorithm behind creating signature matrix from shingles by applying hash functions. However I don't understand how to hash a specific band in a signature matrix to buckets. Assume in matrix M, band b1 we have following values for documents C1-C5:

C1  C2  C3  C4  C5
1   0   0   0   2
3   2   1   2   2
0   1   3   1   1

Just by looking at the values, we see C2 and C4 are identical in this band and they should hash into the same bucket. But the other columns will hash into different buckets.

My question is how to hash these columns into buckets and know if they are identical without actually comparing them. How should I define the hash function to map a column to a bucket? like sum of the elements in the column?

Mina
  • 45
  • 2
  • 6

1 Answers1

0

Trivial way, assume you have 1024 buckets

hash = 1 
for val  in column:
   hash = (( hash * 33 )  +  val ) % 1024

The % 1024 is necessarily the number of buckets you have. 33 is a magic number which works well in practice with strings

the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.

See http://www.cse.yorku.ca/~oz/hash.html

Kuzeko
  • 1,545
  • 16
  • 39
  • Would you like to give insights on why you chose these numbers 33 and 1024, and what you are trying to achieve by doing this? – ABuyukcakir Mar 18 '21 at 04:14