I am looking for a node.js / Javascript module that applies the minhash algorithm to a string or bigger text, and returns me an "identifying" or "characteristic" Bytestring or Hexstring for that text. If I apply the algorithm to another similar text string, the hash string should also be similar. Does a module like that exist already?
The modules I was examining so far had only the possibility to compare texts directly and calculating some kind of jaccard similarity in numbers directly to the compared texts, but I would like to store some kind of hash-string for each document, so I can later compare the strings for similarity if I have similar texts...
Essentially, what I am looking for is this code from here (Java): in Javascript: https://github.com/codelibs/elasticsearch-minhash
for example, for a string like:
"The quick brown fox jumps over the lazy dog"
and "The quick brown fox jumps over the lazy d"
it would create a hash for the first sentence like:
"KV5rsUfZpcZdVojpG8mHLA=="
and for the second string something like:
KV5rsSfZpcGdVojpG8mGLA==
both hash-strings don't differ very much... that's the point in minhash algorithm, however, I don't know how to create that similar hashstring.. and all libraries thus far I have found, only compare directly 2 documents and create a similarity coefficient, but they don't create a hashstring that's characteristic for the document... The similarity with all algorithms is, that they create a hashed crc32 (or similar) hash value for their Array of word tokens (or shingles). But I still do not know how they compare those hashes with each other...