Update
Let's say, for example, in the case where we'd have an array that we want to treat as a multiset.
So you have to process all the entries as they come, you can't use count
, and can't assume the entries come in a known order.
The general function I'd consider is
int hashCode() {
int x = INITIAL_VALUE;
for (Object o : this) {
x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
}
return h(x);
}
Some observations:
- As already stated in the other answers, the INITIAL_VALUE doesn't matter much.
- I wouldn't go for
NULL_HASH=0
since this would ignore null values.
- The function
g
can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).
- The function
h
can be used to improve the result, which is not very important since this already happens e.g. in HashMap.hash(int)
.
- The function
f
is the most important one, unfortunately, it's quite limited as it obviously must be both associative and commutative.
- The function
f
should be bijective in both arguments, otherwise you'd generate unnecessary collisions.
In no case I'd recommend f(x, y) = x^y
since it'd make two occurrences of an element to cancel out. Using addition is better. Something like
f(x, y) = x + (2*A*x + 1) * y
where A
is a constant satisfies all the above conditions. It may be worth it.
For A=0
it degenerates to addition, using an even A
is not good as it shift bits of x*y
out.
Using A=1
is fine, and the expression 2*x+1
can be computed using a single instruction on the x86
architecture.
Using a larger odd A
might work better in case the hashes of the members are badly distributed.
In case you go for a non-trivial hashCode()
you should test if it works correctly. You should measure the performance of your program, maybe you'll find simple addition sufficient. Otherwise, I'd for for NULL_HASH=1
, g=h=identity
, and A=1
.
My old answer
It may be for efficiency reasons. Calling count
may be expensive for some implementations, but entrySet
may be used instead. Still it might be more costly, I can't tell.
I did a simple collision benchmark for Guava's hashCode and Rinke's and my own proposals:
enum HashCodeMethod {
GUAVA {
@Override
public int hashCode(Multiset<?> multiset) {
return multiset.hashCode();
}
},
RINKE {
@Override
public int hashCode(Multiset<?> multiset) {
int result = 0;
for (final Object o : multiset.elementSet()) {
result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
}
return result;
}
},
MAAARTIN {
@Override
public int hashCode(Multiset<?> multiset) {
int result = 0;
for (final Multiset.Entry<?> e : multiset.entrySet()) {
result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
}
return result;
}
}
;
public abstract int hashCode(Multiset<?> multiset);
}
The collision counting code went as follows:
private void countCollisions() throws Exception {
final String letters1 = "abcdefgh";
final String letters2 = "ABCDEFGH";
final int total = letters1.length() * letters2.length();
for (final HashCodeMethod hcm : HashCodeMethod.values()) {
final Multiset<Integer> histogram = HashMultiset.create();
for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
}
}
System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
}
}
and printed
Collisions GUAVA: 45
Collisions RINKE: 42
Collisions MAAARTIN: 0
So in this simple example Guava's hashCode performed really bad (45 collisions out of 63 possible). However, I don't claim my example is of much relevance for real life.