Is there a hashing function that can be used in finding similar (not necessarily equal) strings?

Question

What I need is a hashing function that operates on fixed data sizes, obviously for non security purposes. It needs to map similar strings to similar or equal hashes, in other words small changes in strings should make no or really small changes to hashes.

for example: my name is John, my name is Jon should have the same or really similar hashes. my name is John, your name is Liam should result in somewhat similar hashes. my name is John, I live in USA should give totally different hashes. and so on!

Is there a hashing function for similar purposes?

There is no reliable way of achieving this (due to the pigeonhole principle, essentially). However, there is the concept of *fuzzy hashing*, which might get you part of the way there. — Oliver Charlesworth, Feb 14 '15 at 15:58
This is the first time I've heard of fuzzy hashing, after googling a bit I think this is the closest to what I'm looking for! Would you please post that as an answer? — mewais, Feb 14 '15 at 16:14

score 1 · Accepted Answer · answered Feb 14 '15 at 16:18

There is no reliable way of achieving this. This is due to the pigeonhole principle; there are far fewer ways that two short strings can be "close" than two long strings.

However, there is the concept of fuzzy hashing, which might get you part of the way there.

score 0 · Answer 2 · answered Feb 14 '15 at 16:02

0

It sounds like you're looking for Levenshtein distance (see http://en.wikipedia.org/wiki/Levenshtein_distance).

There are plenty of implementations of this in various languages.

answered Feb 14 '15 at 16:02

Robin Hyman

1
2

1

Possibly. But this is a distance (i.e. f(str1, str2)), not a hash (i.e. f(str)). – Oliver Charlesworth Feb 14 '15 at 16:03
Please don't only post link answers. Just put the essential parts of the link in your answer – Rizier123 Feb 14 '15 at 16:09

score 0 · Answer 3 · answered Feb 14 '15 at 20:54

0

I think in this case Jacard index may be helpful.The Jaccard index is a simple measure of how similiar two sets are. It's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.

There is a blog discussing about Jaccard Similarity Index for measuring Document Similarity which I found more closer to your needs.

answered Feb 14 '15 at 20:54

Razib

10,965
11
53
80

3

Similar to my comment to the Levenshtein distance, this is still ultimately a two-argument distance-metric function, rather than a one-argument thing. – Oliver Charlesworth Feb 14 '15 at 21:10

Is there a hashing function that can be used in finding similar (not necessarily equal) strings?

3 Answers3