Is simhash function that reliable?

Question

I have been strugling with simhash algorithm for a while. I implemented it according to my understanding on my crawler. However, when I did some test, It seemed not so reliable to me.

I calculated fingerprint for 200.000 different text data and saw that, some different content had same fingerprints. So there are a big posibility of collision.

My implementation code is below.

My question is that: If My implementation is right, there is a big collision on this algorithm. How come google use this algorithm? Otherwise, what's the problem with my algorithm?

  public long  CalculateSimHash(string input)
        {
            var vector = GenerateVector(input);

            //5- Generate Fingerprint
            long fingerprint = 0;
            for (var i = 0; i < HashSize; i++)
            {
                if (vector[i] > 0)
                {
                    var zz = Convert.ToInt64(1 << i);
                    fingerprint += Math.Abs(zz);
                }
            }
            return fingerprint;
        }

 private int[] GenerateVector(string input)
        {
            //1- Tokenize input
            ITokeniser tokeniser = new OverlappingStringTokeniser(2, 1);
            var tokenizedValues = tokeniser.Tokenise(input);

            //2- Hash values
            var hashedValues = HashTokens(tokenizedValues);

            //3- Prepare vector
            var vector = new int[HashSize];
            for (var i = 0; i < HashSize; i++)
            {
                vector[i] = 0;
            }

            //4- Fill vector according to bitsetof hash
            foreach (var value in hashedValues)
            {
                for (var j = 0; j < HashSize; j++)
                {
                    if (IsBitSet(value, j))
                    {
                        vector[j] += 1;
                    }
                    else
                    {
                        vector[j] -= 1;
                    }
                }
            }
            return vector;

As I see, there is to much possibility to have birthday problem in simhash. My question is how come google trust and use it? @DanielA.White — mavera, Oct 27 '15 at 12:46

score 1 · Answer 1 · answered Oct 10 '17 at 04:50

I can see a couple of issues. First, you're only getting a 32-bit hash, not a 64-bit, because you're using the wrong types. See https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/left-shift-operator It's also best not to use a signed integer type here, to avoid confusion. So:

// Generate Fingerprint
ulong fingerprint = 0;
for (int i = 0; i < HashSize; i++)
{
    if (vector[i] > 0)
    {
        fingerprint += 1UL << i;
    }
}

Second issue is: I don't know how your OverlappingStringTokenizer works -- so I'm only guessing here -- but if your shingles (overlapping ngrams) are only 2 characters long, then a lot of these shingles will be found in a lot of documents. Chances are that two documents will share a lot of these features even if the purpose and meaning of the documents is quite different.

Because words are the smallest simple unit of meaning when dealing with text, I normally count my tokens in terms of words, not characters. Certainly 2 characters is far too small for an effective feature. I like to generate shingles from, say, 5 words, ignoring punctuation and whitespace.

Is simhash function that reliable?

1 Answers1