0

What I need is a hashing function that operates on fixed data sizes, obviously for non security purposes. It needs to map similar strings to similar or equal hashes, in other words small changes in strings should make no or really small changes to hashes.

for example: my name is John, my name is Jon should have the same or really similar hashes. my name is John, your name is Liam should result in somewhat similar hashes. my name is John, I live in USA should give totally different hashes. and so on!

Is there a hashing function for similar purposes?

Cœur
  • 37,241
  • 25
  • 195
  • 267
mewais
  • 1,265
  • 3
  • 25
  • 42
  • There is no reliable way of achieving this (due to the pigeonhole principle, essentially). However, there is the concept of *fuzzy hashing*, which might get you part of the way there. – Oliver Charlesworth Feb 14 '15 at 15:58
  • This is the first time I've heard of fuzzy hashing, after googling a bit I think this is the closest to what I'm looking for! Would you please post that as an answer? – mewais Feb 14 '15 at 16:14

3 Answers3

1

There is no reliable way of achieving this. This is due to the pigeonhole principle; there are far fewer ways that two short strings can be "close" than two long strings.

However, there is the concept of fuzzy hashing, which might get you part of the way there.

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
0

It sounds like you're looking for Levenshtein distance (see http://en.wikipedia.org/wiki/Levenshtein_distance).

There are plenty of implementations of this in various languages.

0

I think in this case Jacard index may be helpful.The Jaccard index is a simple measure of how similiar two sets are. It's simply the ratio of the size of the intersection of the sets and the size of the union of the sets.

There is a blog discussing about Jaccard Similarity Index for measuring Document Similarity which I found more closer to your needs.

Razib
  • 10,965
  • 11
  • 53
  • 80
  • 3
    Similar to my comment to the Levenshtein distance, this is still ultimately a two-argument distance-metric function, rather than a one-argument thing. – Oliver Charlesworth Feb 14 '15 at 21:10