0

I was hoping to create an easily memorable hash something like 3 random words (what3words), so the idea would be to hash a java object and the result was three random words.

Use case: I have objects with many fields and I need to compress the field into 24 chars (this is the size of a varchar primary key column in a database where these objects are stored, and cannot be changed), the resulting compressed value should also be easily memorable.

Initially, I decided to use 3 different hash functions (namely FNV1a64Hash, CRC32Hash and DJB2) to create 3 pre-hashes and then use those values as indexes into the dictionary however this resulted in a lot of collisions (Random Words tried: 10000000 No of collisions: 9272419). Note my dictionary size is approx 50k words.

Next, I decided to just call hashCode() on the object, then pad the resulting int and finally split it into 3 chunks of 5 digits, unfortunately again a lot of collisions occurred (Random Words tried: 10000000 No of collisions: 9999900). I think this in part could be down to the max size of an int being 2^31 which is only a 10 digit number so the first index was always 00000.

I've also used universal hashing but again I got a fair amount of collisions (Random Words tried: 10000000 No of collisions: 9996436)

I am wondering if I am missing something obvious here or if anyone knows of any well-known algorithms that might be able to help here? Apologies in advance for the noob question, this is the first time I've encountered hashing and there is a whole lot to learn.

I've pasted my code & test code below in case there is something obviously wrong with that.

public static String generate3Words1(Object obj) {
    BigInteger input = BigInteger.valueOf(obj.hashCode());
    int index1 = indexFor(CRC32Hash(input.toByteArray()));
    int index2 = indexFor(FNV1a64Hash(input.toByteArray()));
    int index3 = indexFor(DBJ2(input.toByteArray()));
    return dictionary.get(index1) + "-" + dictionary.get(index2) + "-" + dictionary.get(index3);
}


public String generate3Words2(Object obj) {
       int h = (h = obj.hashCode()) ^ (h >>> 16);
       String i = String.format("%015d", h);      
       String s = dictionary.get(indexFor(Integer.parseInt(i.substring(0, 5)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(5, 10)))) + "-" + dictionary.get(indexFor(Integer.parseInt(i.substring(10, 15))));
       return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
   }

private static int indexFor(long h) {
    return (int) (h & (ThreeWordHash.dictionary.size() - 1));
}

private static long FNV1a64Hash(byte[] data) {
    long hash = 0xcbf29ce484222325L;
    for (byte datum : data) {
        hash ^= (datum & 0xff);
        hash *= 1099511628211L;
    }
    return hash;
}

private static long CRC32Hash(byte[] data) {
    CRC32.reset();
    CRC32.update(data);
    return CRC32.getValue();
}


private static long DBJ2(byte[] data) {
    long hash = 5381;
    for (byte datum : data) {
        hash = ((hash << 5) + hash) + datum;
    }
    return hash;
}

private static String universalHashing(Object data) {
     int[] hashCodes = new int[NO_OF_WORDS];
     int hashCodeSizeDiff = WORD_SIZE - (WORD_SIZE / 2);
     int hstart = data.hashCode();
     int bmax = 1 << hashCodeSizeDiff;
     for (int i = 0; i < NO_OF_WORDS; i++) {
         hashCodes[i] = (((hstart * (i * 2 + 1)) + RAND.nextInt(bmax)) >> hashCodeSizeDiff) & (ThreeWordHash.dictionary.size() - 1);
     }
     String s = ThreeWordHash.dictionary.get(hashCodes[0]) + " " + ThreeWordHash.dictionary.get(hashCodes[1]) + " " + ThreeWordHash.dictionary.get(hashCodes[2]);
     return s.length() > MAX_LEN ? s.substring(0, MAX_LEN) : s;
 }

Test code:

@Test
void generate3Words() {
    List<String> words = new ArrayList<>(TestDictionary.WORDS);
    words.addAll(TestDictionary.WORDS);

    Random random = new Random(1);
    HashSet<String> seen1 = new HashSet<>();
    HashSet<String> seen2 = new HashSet<>();
    
    int count = 0;
    int noOfIterations = 10000000;
    
    //NOTE test dict size approx 4k words
    for (int j = 0; j < noOfIterations; j++) {
        String randomWord =  new StringBuilder()
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size())))
                .append(words.get(random.nextInt(TestDictionary.WORDS.size()))).toString();
      
        String res = ThreeWordHash.generate3Words(randomWord);
        
        if (seen1.contains(res) && !seen2.contains(randomWord)) {
            count++;
        }

        seen2.add(randomWord);
        seen1.add(res);
    }
    System.out.println("Random Words tried: " + seen2.size() + " No of collisions: " + count);
}
Henry Hargreaves
  • 285
  • 4
  • 18
  • What amount is "fair amount" of collisions? How many collisions per million? Considering you've tried several algorithms, it's more likely there's something wrong with your execution. Show the code to your object, its `hashCode()` and other relevant hashing functionality. – Kayaman Feb 23 '22 at 11:15
  • Hi, Just updated the question, the input object in this case will always be a string representation of the object. – Henry Hargreaves Feb 23 '22 at 11:33
  • You could check the distribution of your `randomWord.hashCode()` to see that it's not skewed and affect the rest of the code. You should also try with different number of iterations and plot them out, the amount of collisions obviously rises with iterations. – Kayaman Feb 23 '22 at 11:57
  • Don't use the `hashCode()` method if you need (almost) collision free hashes - `Object.hashCode()` was never meant to be collision free. For example, both strings "Ab" and "BC" have the same hash code (2113), and all your post-processing of that hash code cannot remove the collision. Instead of that you should start with a cryptographic hash function like SHA1 or MD5 over the original string – Thomas Kläger Feb 23 '22 at 12:29
  • There's a minor flaw in your 3hash method. It's taking 3 hashes of `hashCode()` of String instead of taking 3 hashes of String. Also your keyspace is only 4k³ = 64 billion which isn't very much. I recommend plotting different [iteration counts](https://en.wikipedia.org/wiki/Birthday_problem). – Kayaman Feb 23 '22 at 13:07

0 Answers0