Efficient hash code for multiset in Java

Question

I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag). It may not contain null elements, although that's not crucial to my question. The equals contract defined by the interface is as you would expect:

obj instanceof MyInterface
obj contains the same elements as this (by equals)
obj contains the same number of duplicates for each element
order of elements is disregarded

Now I want to write my hashCode method. My initial idea was:

int hashCode = 1;
for( Object o : this ) {
    hashCode += o.hashCode();
}

However, I noticed that com.google.common.collect.Multiset (from Guava) defines the hash code as follows:

int hashCode = 0;
for( Object o : elementSet() ) {
    hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}

It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don't understand the benefit of ^ count(o) over simply adding up the hash codes of every duplicate. Maybe it's about not calculating the same hash code more than once, but then why not * count(o)?

My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.

@Mitch Wheat: I strongly disagree, the only difference between XOR and addition is the presence of carry. As the counts are usually small numbers it can't lead to good hashcodes. Something like `* (2*count(o)+1)` should work much better. PS: I haven't tried it out yet. — maaartinus, Sep 16 '11 at 17:23
So I tried it out with a rather contrived example, in which Guava's hashCode performs quite bad (and my modified idea shines). See my answer (and don't take it too seriously). @Rinke: What about using `entrySet` instead? — maaartinus, Sep 16 '11 at 18:26
@maaartinus: you can disagree all you want, but distribution of hash values is the primary reason XOR is used. — Mitch Wheat, Sep 16 '11 at 23:58
@Mitch Wheat: You can use XOR as much as you want, but you can't expect good results (see my answer). — maaartinus, Sep 17 '11 at 00:35
hash functions: http://eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx — Mitch Wheat, Sep 17 '11 at 02:44
@maaartinus: Get your facts straight. I didn't say a simple XOR was a good hash. I said it was used over addition because it distributes values better than addition. — Mitch Wheat, Sep 17 '11 at 02:46
@Mich Wheat: So, do you think there's a better way to calculate the hash for a multiset than `hash += (2*hash+1) * element.hashCode()` for each element? — Rinke, Sep 27 '11 at 08:38

score 2 · Answer 1 · answered Sep 16 '11 at 00:25

If count is expensive, don't do it. Do you know it's too expensive? You can always code several implementations and profile their performance with data you expect to be representative of your application. Then you'll know the answer instead of guessing.

As for why you use XOR, see 'Calculating Aggregate hashCodes with XOR'.

score 2 · Answer 2 · answered Sep 16 '11 at 02:15

It strikes me as odd that an empty Multiset would have hash code 0

Why? All empty collections probably have hash code 0. Even if not, it would have to be a fixed value (since all empty collections are equal), so what is wrong with 0?

what would be an efficient hash code calculation?

Yours is more efficient (which just means faster to calculate), not too that bad in terms of effectiveness (which means yielding results that work well), either. If I understand it correctly, it adds up the hash codes of all elements (with duplicate elements being added twice). This is exactly what a regular Set does, so if you have no duplicates, you get the same hashCode as with a Set, which could be an advantage (if you fix the empty set to have hashCode 0, not 1).

Google's version is a little more complicated, I suppose in order to avoid some otherwise frequent collisions. Of course it probably causes some other collisions that are considered less frequent to happen instead.

In particular, using XOR spreads the hashCodes all over the available range, even if the individual input hashCodes do not (which they for example do not for Integers from a limited range, which is a frequent use-case).

Consider the hashCode for the Set [ 1, 2, 3]. It is 6. Likely to collide with similar Sets, for example [ 6], [ 4, 2] , [5, 1]. Throwing some XOR in there helps. If it is necessary and worth the extra cost is a tradeoff you have to make.

The 0 striked me as odd because I thought it's good practice to reserve it for hashing nulls, that's all. Of course there's nothing wrong with it. — Rinke, Sep 16 '11 at 09:35
You understood my hash function perfectly. And if I understand you correctly, it would (or could) lead to more collisions compared to the XOR method. I'm afraid I don't really understand why. (Doesn't HashMap have optimizations for this?) I also don't understand why adding would be more efficient in general. In my particular case it probably is, but if getting count is cheap I'd say the XOR method is more efficient. — Rinke, Sep 16 '11 at 09:42
Sorry for the down vote but all collections don't have a hashCode of zero, all empty collections are not equal, and the Collection interface does not imply any additional clauses to the Object equals clause. The list and set interfaces do, but a list and set can never be equal, so an empty list and an empty set are not equal. — Steve Lancashire, Apr 17 '12 at 12:24

maaartinus · Accepted Answer · 2011-09-18T12:34:47.883

Update

Let's say, for example, in the case where we'd have an array that we want to treat as a multiset.

So you have to process all the entries as they come, you can't use count, and can't assume the entries come in a known order.

The general function I'd consider is

int hashCode() {
    int x = INITIAL_VALUE;
    for (Object o : this) {
        x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
    }
    return h(x);
}

Some observations:

As already stated in the other answers, the INITIAL_VALUE doesn't matter much.
I wouldn't go for NULL_HASH=0 since this would ignore null values.
The function g can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).
The function h can be used to improve the result, which is not very important since this already happens e.g. in HashMap.hash(int).
The function f is the most important one, unfortunately, it's quite limited as it obviously must be both associative and commutative.
The function f should be bijective in both arguments, otherwise you'd generate unnecessary collisions.

In no case I'd recommend f(x, y) = x^y since it'd make two occurrences of an element to cancel out. Using addition is better. Something like

f(x, y) = x + (2*A*x + 1) * y

where A is a constant satisfies all the above conditions. It may be worth it. For A=0 it degenerates to addition, using an even A is not good as it shift bits of x*y out. Using A=1 is fine, and the expression 2*x+1 can be computed using a single instruction on the x86 architecture. Using a larger odd A might work better in case the hashes of the members are badly distributed.

In case you go for a non-trivial hashCode() you should test if it works correctly. You should measure the performance of your program, maybe you'll find simple addition sufficient. Otherwise, I'd for for NULL_HASH=1, g=h=identity, and A=1.

My old answer

It may be for efficiency reasons. Calling count may be expensive for some implementations, but entrySet may be used instead. Still it might be more costly, I can't tell.

I did a simple collision benchmark for Guava's hashCode and Rinke's and my own proposals:

enum HashCodeMethod {
    GUAVA {
        @Override
        public int hashCode(Multiset<?> multiset) {
            return multiset.hashCode();
        }
    },
    RINKE {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Object o : multiset.elementSet()) {
                result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
            }
            return result;
        }
    },
    MAAARTIN {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Multiset.Entry<?> e : multiset.entrySet()) {
                result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
            }
            return result;
        }
    }
    ;
    public abstract int hashCode(Multiset<?> multiset);
}

The collision counting code went as follows:

private void countCollisions() throws Exception {
    final String letters1 = "abcdefgh";
    final String letters2 = "ABCDEFGH";
    final int total = letters1.length() * letters2.length();
    for (final HashCodeMethod hcm : HashCodeMethod.values()) {
        final Multiset<Integer> histogram = HashMultiset.create();
        for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
            for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
                histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
            }
        }
        System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
    }
}

and printed

Collisions GUAVA: 45
Collisions RINKE: 42
Collisions MAAARTIN: 0

So in this simple example Guava's hashCode performed really bad (45 collisions out of 63 possible). However, I don't claim my example is of much relevance for real life.

Interesting answer. How did you come up with your hash function? It seems to works great, but what's the math behind it? What would you suggest if getting a count is expensive? (Let's say, for example, in the case where we'd have an array that we want to treat as a multiset.) — Rinke, Sep 17 '11 at 17:20
The math is just a simple observation: XOR and ADD are quite similar, multiplication makes it better. However, computing `x*y` isn't good for mixing `x` and `y`. Something like `(2*x+1)*(2*xy+1)` works well in general, using bigger constants (in place of the 1) may prevent some collisions nearly for free. — maaartinus, Sep 17 '11 at 21:12
In my above comment it should read `(2*x+1)*(2*y+1)`, but it's not really good, as the result is always odd. However, the multiplication by an odd number is the key here. Using `x+(2*x+1)*y` which is the same as `2*x*y+x+y+1` is fine, multiplying a term by an odd number makes it better in case of small `x` and `y`. — maaartinus, Sep 18 '11 at 09:49

score 1 · Answer 4 · answered Sep 27 '11 at 22:28

I observe that java.util.Map uses more or less the same logic: java.util.Map.hashCode() is specified to return map.entrySet().hashCode(), and Map.Entry specifies that its hashCode() is entry.getKey().hashCode() ^ entry.getValue().hashCode(). Accepting the analogy from Multiset to Map, this is exactly the hashCode implementation you'd expect.

Efficient hash code for multiset in Java

4 Answers4

Update

My old answer