-1

Modifying the hashCode() method in java such that vectors can generate same hashcode for vectors that have jaccard similarity above a certain threshold with good accuracy

example:

vector 1: [1,1,0,0,1,0] vector 2: [1,1,0,0,0,0]

they have jaccard similarity of: 0.5

How can i modify the hashCode() method in Java such that vectors that have a similarity of 0.5 and above can go into the same bucket/or same hashcode?

Note: I am not doing it the minhash lsh and candidate pair way. It has to generate the hashcode just with vector itself

The goal is not to do it perfectly(which is impossible), but to do it as accurately as possible.

There will be situation where vector A and B, B and C can go together while A and C couldn't. The hashing function has to map it to either A with B, or B with C, or just A,B and C together

dydy
  • 1
  • 1
  • How is jaccard similarity calculated? Calculate it, convert it to an int value (e.g. multiplying), return it. Note that `hashCode` can only be calculated for a single instance. The `hashCode` of vector1 cannot depend on the hashCode of vector2. What do you gain by putting those into the same bucket (of a set/map?)? – knittl Oct 24 '22 at 18:15
  • what would the *jaccard similarity* between `[1,1,0,0,0,0]` and `[1,0,0,0,0,0]` be? And between `[1,1,0,0,0,0]` and `[1,1,0,0,0,1]` ? – user16320675 Oct 24 '22 at 20:18

1 Answers1

0

This is impossible. Jaccard similarity is calculated among two or more vectors, while the hash code must be dependent only on the contents of a single vector.

You can easily construct three vectors A, B and C such that (A,B) and (B,C) satisfy your criteria, meaning all three generate the same hash code, but (A,C) does not.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • 2
    I would even go one step further (despite not sure how the *jaccard distance* is calculated): given any two vectors A and B, it is possible to find a sequence of vectors starting with A and ending with B such that the distance between any two successive pairs of vectors is 0.5 or above -> all vectors are in the same bucket (Had similar *discussion* these days on [Generate hashcode in java for similar strings based on hamming distance](https://stackoverflow.com/questions/74169949/generate-hashcode-in-java-for-similar-strings-based-on-hamming-distance#comment130952858_74169949) - now deleted) – user16320675 Oct 24 '22 at 20:30