Leveinshtein and hash - finding one hash algorithm that results in correlation (closer distance)

Question

I am looking for a hash-kind algorithm that does not provide any security but rather a fixed and distinct pattern for a string, in such a way that a near similar string can be correlated using Leveinshtein distance calculation or any distance metric.

Let's say I have two strings "hello/friend/my?" and "hello/friend/my", and I calculate the distance (Levenshtein) without and with hash in python:

>>> import Levenshtein as lev
>>> Str1 = "hello/friend/my?"
>>> Str2 = "hello/friend/my"
>>> Distance = lev.distance(Str1.lower(),Str2.lower()),
>>> print(Distance)
>>> Ratio = lev.ratio(Str1.lower(),Str2.lower())
>>> print(Ratio)

(1,)

0.967741935483871

>>> Str1hash = hash(Str1)
>>> Str2hash = hash(Str2)
>>> Distance = lev.distance(str(Str1hash), str(Str2hash)),
>>> print(Distance)
>>> Ratio = lev.ratio(str(Str1hash), str(Str2hash))
>>> print(Ratio)

(16,)

0.41025641025641024

You can see that the values generated without hash, shows a closer distance (1) and with hash the distance is too far (16).

I would like to find a hash-kind of function or algorithm that returns a closer distance and ratio between similar strings. Any clue?

By definition, hash do not fit your needs. Would treating your string as an integer and do a modulo operation on it do the work for you? — Lou_is, Sep 12 '19 at 12:46
I wonder why you included the "cryptography" and "logic" tags. — Erwan Legrand, Sep 12 '19 at 16:21
In general, the nature of hashing is incompatible with what you're trying to do. But [Locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) might help you out here. — Jim Mischel, Sep 13 '19 at 18:39
By the way "distinct" and "hash code" are generally incompatible. There is an essentially infinite number of possible strings. For any finite-length hash code, the [Pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle) applies: you *will* get collisions (two vastly different strings that hash to the same value). — Jim Mischel, Sep 13 '19 at 18:42

score 3 · Accepted Answer · answered Sep 20 '19 at 09:13

The solution I wanted is LSH: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

It solves the question, I posed. It's a technique used in Information Retrieval to find duplicates documents or web pages. Thus I can use the same to compare my two strings and get their similarity index.

score 0 · Answer 2 · answered Sep 13 '19 at 08:39

0

Hash function by definition should put similar objects as far as possible, so what you're looking for does not exists. You may try to use some kind of simple character substitution encoding, like ROT13, this might be the answer to your question, but please, don't call it hashing =)

https://en.wikipedia.org/wiki/ROT13

answered Sep 13 '19 at 08:39

lenik

23,228
4
34
43

It does exist. After some research, I found this: https://en.wikipedia.org/wiki/Locality-sensitive_hashing – c1377554 Sep 20 '19 at 09:14
@c1377554 this is not exactly what you asked about, but if you're happy with, congrats! =) – lenik Sep 20 '19 at 10:13

Leveinshtein and hash - finding one hash algorithm that results in correlation (closer distance)

2 Answers2