HashMap implementation such that that inputs less than some specified hamming distance away from a key map to the same bucket as that key?

Question

Okay, this one is a bit of a doozy, but here goes...

‎

I have computed perceptual hashes for some amount of images where I wish to count occurrences of near-duplicates.

The way this is currently being done is by throwing every hash into a HashMap, incrementing the associated value if a hash already exists as key, otherwise adding it as a unique key with a value of 1. (values representing the observed count)

An issue with this approach is that some images are only similar, and therefore do not produce the same hash, leading to this not being accurately reflected in the counts, which was expected.

(the defining property of such a hash function principally being that similar images produce alike, but not identical, hashes)

‎

The most straight-forward way of accomplishing this would, of course, be to compute the hamming distances between new inputs and every already existing key ﹘ returning the value for a key, if one exists within the threshold, and otherwise just use the input as a unique key. (this is not what I'm looking for) ‎‎ ‎

‎

I am wondering if there is a way to design a locally sensitive hash function for a HashMap such that inputs less than some specified hamming distance away from each-other will produce the same output? (intentional hash collision)

The inputs to this hash function would always be a guaranteed constant size of 64 bits (the perceptual hashes)

The specific perceptual hashing algorithm used is irrelevant (and may even change), but for the sake of simplicity let's assume it's AverageHash, the important part is that the input is always 64 bits. ‎ ‎

‎

I hope this question isn't too confusing since it involves hashing of already hashed values. ‎

‎Since the core problem is about efficiently associating similar hashes, the answer does not necessarily have to answer the question, if there already exists some other data structure, or algorithm, for accomplishing the task, I'm happy to hear about those as well.

You should look at ScaNN and Faiss, and perhaps vector databases in general. Efficient nearest neighbor lookup is non-trivial, so it's a bit more complicated than a variation of HashMap — Marat, Jul 02 '23 at 03:05
Also, project [Milvus](https://milvus.io/docs/index.md) offers some in-memory indexes which might be efficient at small scale — Marat, Jul 02 '23 at 03:13

HashMap implementation such that that inputs less than some specified hamming distance away from a key map to the same bucket as that key?

0 Answers0