Is there a hash function for strings, such that strings within a small edit distance (for example, misspellings) would map to the same, or very close, hash values, while dissimilar strings would tend not to?
Asked
Active
Viewed 2,231 times
2
-
1The magic google words are "similarity preserving hashing". There are a bunch of such hash functions for different purposes, and they're not awesome so there are always trade-offs. – Matt Timmermans Aug 25 '17 at 00:20
-
@MattTimmermans Isn't LSH the conventional name for these (both in the title and tag)? I just don't know of LSH for edit distances. – MWB Aug 25 '17 at 00:30
-
IIRC, Locality-sensitive hashing refers to mapping a vector space into a smaller dimensional space in a way that attempts to preserve nearness by a Euclidean or similar distance metric. – Matt Timmermans Aug 25 '17 at 00:44
1 Answers
0
One option is to calculate set of all k
-mers (substrings of length k
), hash them and calculate the minimum.
So you are combining idea of shingles, with idea of minhashing.
(repeat multiple times to get better results, as usual with LSH schemes)
The way how this works is that probability of two string having same minhash is same as Jackard similarity of their k
-mer sets.
Similarity of k
-mer sets is related to edit distance (but not the same).

usamec
- 2,156
- 3
- 20
- 27
-
I don't think this would work: The hash would be determined by a **single** k-mer. Another highly similar string might be missing that k-mer, or have a mutation in it, and so the hashes of the two strings would be arbitrarily different. – MWB Aug 25 '17 at 10:06
-
That why you have to use multiple hashes. Similar things is use for finding similar sets (again minhash in that case is single element of the set). And also it is not determined by single k-mer, since all k-mers come into decision which one is minimal. This minimal k-mer approach is quite popular in bioinformatics actually and proven to work: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r46 – usamec Aug 25 '17 at 14:58