4

I'm storing a large number of objects (with unique combinations of values stored in a byte array in the object) in a hashmap (~2.8million objects) and, when checking if I have any collision of hash code (32-bit hash), I'm very surprised to see there is none while statistically, I have nearly 100% chances of having at least one collision (cf. http://preshing.com/20110504/hash-collision-probabilities/).

I am thus wondering if my approach to detect collisions is bugged or if I'm extremely lucky...

Here is how I try to detect collisions from the 2.8million values stored in the map:

HashMap<ShowdownFreqKeysVO, Double> values;
(...fill with 2.8 mlns unique values...)
HashSet<Integer> hashes = new HashSet<>();
for (ShowdownFreqKeysVO key:values.keySet()){
    if (hashes.contains(key.hashCode())) throw new RuntimeException("Duplicate hash for:"+key);
    hashes.add(key.hashCode());
}

And here is the object's approach to create a hash value:

public class ShowdownFreqKeysVO {
    //Values for the different parameters
    public byte[] values = new byte[12];

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + Arrays.hashCode(values);
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        ShowdownFreqKeysVO other = (ShowdownFreqKeysVO) obj;
        if (!Arrays.equals(values, other.values))
            return false;
        return true;
    }
}

Any idea/hint on what I'm doing wrong would be greatly appreciated !

Thanks, Thomas

Tom
  • 1,375
  • 3
  • 24
  • 45
  • What does `hashes` contain after this line `HashSet hashes = new HashSet<>();` ? How are you populating values for `hashes` ? – Vikas V Dec 21 '13 at 14:38
  • 1
    He adds them with `hashes.add(key.hashCode());` in the loop. – meriton Dec 21 '13 at 14:41
  • on a side note, completely unrelated, if prime and result are set to a constant right before you do `result = prime * result + ...`, something looks wrong there. – mprivat Dec 21 '13 at 14:43
  • 1
    @mprivat That's just common boilerplate to implement hashCode. – Marko Topolnik Dec 21 '13 at 14:44
  • 2
    That looks right to me. Check that the set contains 2.8 million elements: maybe something is wrong in `(...fill with 2.8 mlns unique values...)`. or maybe you're lucky. I would also simply use `return Arrays.hashCode(values)`: adding 31 to this hashCode is unnecessary. – JB Nizet Dec 21 '13 at 14:48
  • You may for example be creating a map of just a handful of unique entries. – Marko Topolnik Dec 21 '13 at 14:51
  • Thanks all for your suggestions. I confirm that map contains 2.8mln entries (values.size() returns >2.8mln). Hashcode is indeed auto-generated by Eclipse: I'll remove the unnecessary constant part. Still hard for me to believe that I'm lucky and don't have any collision... or I should start playing at the lottery ! – Tom Dec 21 '13 at 14:53
  • Other side note: your byte array shouldn't be public. Especially in a class used as a key of a map, which should be immutable (as I already told you in a previous answer). – JB Nizet Dec 21 '13 at 14:55
  • How much is 2^(12*8)? I guess by magnitude more than 2.8 milions, right? If you fill randomly 12 bytes long array, it is really small chance to have a collision – Leos Literak Dec 21 '13 at 15:02
  • 2
    @LeosLiterak: You should read about the birthday paradox (http://en.wikipedia.org/wiki/Birthday_problem). There is 50% probability that two persons are born the same day of the year in a population of only 23 people (although 23 is much less than 365). And we're not talking about 2^(12*8) here, but about 2^32 (the number of possible int hashCodes) – JB Nizet Dec 21 '13 at 15:04
  • Indeed, if you read the link I posted in the question, you'll see that with just 77k values you already have a 50% chance of collision... – Tom Dec 21 '13 at 15:09
  • @Tom have you tried running it several times? – Octoshape Dec 21 '13 at 15:26

2 Answers2

5

I don't believe in luck

This is the implementation of Arrays.hashCode that you use

public static int hashCode(int a[]) {
    if (a == null)
        return 0;

    int result = 1;
    for (int element : a)
        result = 31 * result + element;

    return result;
}

If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). Lets call those pure hashes

Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution.

Lets consider a smaller example. I assume you have only 2 elements in your array and the range from 0 to 5 each. I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision.

val hashes = for {
  i <- 0 to 4
  j <- 0 to 4
} yield (i * 31 + j) % 38

println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25

To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. I bet you will see an extremely regular pattern.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • Thanks! Indeed, all values stored in the byte array have values lower than 31 (they range between -1 and 15) – Tom Dec 21 '13 at 15:59
0

I see nothing wrong with your code, but the analysis you link to assumes that hashCodes are uniformly distributed, and that the hashCodes of different objects are independent random variables.

The latter may not be true: You know that the objects are unique (and therefore not independent). Perhaps that particular brand of uniqueness is preserved by the hashCode function.

meriton
  • 68,356
  • 14
  • 108
  • 175