Questions tagged [simhash]

Algorithm to detect similarities between hashes.

simhash was developed by Moses Charikar. The algorithm described in the paper.

21 questions
1
vote
1 answer

calculate pairwise simhash "distances"

I want to construct a pairwise distance matrix where the "distances" are the similarity scores between two strings as implemented here. I was thinking of using sci-kit learn's pairwise distance method to do this, as I've used it before for other…
user139014
  • 1,445
  • 2
  • 19
  • 33
1
vote
2 answers

Similarity Hash function(simhash)

I have a problem with using hash function. I have to assign some number(128 bit or 64 bit) with every word in the document. So, the hash value of "similarity" must be near with "similar". That means, if has value of similarity=>10022(say) then…
MrYo
  • 1,797
  • 3
  • 19
  • 33
0
votes
0 answers

Check which string is approximately contained in the other string at scale

I have the following practical scenario. Imagine you have a column of strings lets call them "description". And you have another column of strings (usually shorter) lets call them "name". The task is to find which "name" is contained in the every…
Bociek
  • 1,195
  • 2
  • 13
  • 28
0
votes
1 answer

How to detect the similar text on big data?

As i just know, simhash and minhash are available on this task. But all those algorithms have to traverse the whole text database which will be quite aweful. Is there any optimization or other algorithm that can accelebrate the task? All I come up…
Leo Zhao
  • 77
  • 1
  • 12
0
votes
1 answer

Is simhash function that reliable?

I have been strugling with simhash algorithm for a while. I implemented it according to my understanding on my crawler. However, when I did some test, It seemed not so reliable to me. I calculated fingerprint for 200.000 different text data and saw…
mavera
  • 3,171
  • 7
  • 45
  • 58
0
votes
1 answer

python simhash doesn't work on ubuntu

I have the same setup and code on mac for running simhash, it works. But when I run it on Ubuntu, it complaints the implementation of simhash itself has the bug. Have you encountered such problem? objs = [(str(k), Simhash(v)) for k, v in…
Ben
  • 59
  • 2
  • 6
1
2