Is it a good approach to generate hash codes?

Question

I have to write a hash function, under the following two conditions:

I don't know anything about Object o that is passed to the method - it can be a String, and Integer, or an actual custom object;
I am not allowed to call hashCode() at all.

Approach that I am using now, to calculate the hash code:

Write object to the byte stream;
Convert byte stream to the byte array;
Loop through the byte array and calculate hash by doing something like this:

hash = hash * PRIME + byteArray[i]

My question is it a passable approach and is there a way to improve it? Personally I feel like the scope for this function is too broad - there is no information about what the objects are, but I have little say in this situation.

This sounds like homework (*why* aren't you allowed to call `hashCode`?). If it is, please tag it so that you'll get different (and *better*) answers. — phihag, Jul 08 '11 at 13:42
@Nikita As you don't know specifics about any object, I think you are doing it the best possible way. — Marcelo, Jul 08 '11 at 13:48
@Nikita I assume that System.identityHashCode(o) is out of the question (it would certainly be faster than what you are doing)? There are other things you could do... like o.getClass().getDeclaredFields() and hash all those or all their values together... — Matt Wonlaw, Jul 08 '11 at 13:55
@mlaw, my specs are extremely scarce, but I would assume it's out of the question because it will return the same thing hashCode will (none of the objects that I need to support is going to provide it's own implementation of hashCode()). — Nikita, Jul 08 '11 at 14:01
@mlaw. The reflection part is essentially what `HashCodeBuilder.reflectionHashCode` does — Kaj, Jul 08 '11 at 14:20

score 3 · Answer 1 · answered Jul 08 '11 at 13:56

3

You could use HashCodeBuilder.reflectionHashCode instead of implementing your own solution.

answered Jul 08 '11 at 13:56

Kaj

10,862
2
33
27

I _must_ implement my own solution, this is the point of the question :) – Nikita Jul 08 '11 at 15:45

Paŭlo Ebermann · Answer 2 · 2011-07-08T14:15:46.327

The serialization approach does only work for objects which in fact are serializable. Thus, for all types of objects is not really possible.

Also, this compares objects by have equivalent object graphs, which is not necessarily the same as are equal by .equals().

For example, StringBuilder objects created by the same code (with same data) will have an equal OOS output (i.e. also equal hash), while b1.equals(b2) is false, and a ArrayList and LinkedList with same elements will be register as different, while list1.equals(list2) is true.

You can avoid the convert byte stream to array step by creating a custom HashOutputStream, which simply takes the byte data and hashes it, instead of saving it as an array for later iteration.

class HashOutputStream extends OutputStream {

    private static final int PRIME = 13;
    private int hash;

    // all the other write methods delegate to this one
    public void write(int b) {
        this.hash = this.hash * PRIME + b;
    }

    public int getHash() {
        return hash;
    }
}

Then wrap your ObjectOutputStream around an object of this class.

Instead of your y = y*13 + x method you might look at other checksum algorithms. For example, java.util.zip contains Adler32 (used in the zlib format) and CRC32 (used in the gzip format).

To handle the problem with not all objects being serializable, the object could be converted to a String using toString(). It looks like the object is being passed into the hashCode method, so this is not Object's hashCode method, so the equals consistency does not have to be maintained. — Michael Krussel, Jul 08 '11 at 18:54

score 0 · Answer 3 · answered Jul 08 '11 at 13:45

0

hash = (hash * PRIME + byteArray[i]) % MODULO ?

answered Jul 08 '11 at 13:45

lacungus

77
4

Not required - There is an implicit modulo of 2^64. – phihag Jul 08 '11 at 13:46
@mlaw Depends on the internal representation. But you're right, most likely `int` all the way. – phihag Jul 08 '11 at 13:57

phihag · Answer 4 · 2011-07-08T14:35:38.287

0

Also, while you're at it, if you want to avoid collisions as much as possible, you can use a standardized (cryptographic if intentional collisions are an issue) hash function in step 3, like SHA-2 or so?

Have a look at DigestInputStream, which also spares you step 2.

edited Jul 08 '11 at 14:35

answered Jul 08 '11 at 13:45

phihag

278,196
72
453
469

fixed, "convert byte stream to byte array" - just so I can iterate over it. – Nikita Jul 08 '11 at 13:47
-1 Step 2 is obviously a typo and using **any** hash function throws uniqueness out the window -- it's hashing not compressing. – Blindy Jul 08 '11 at 13:49
@Nikita You don't need to create a byte array (it will take lots of memory; most hash functions are specified as handling a stream of short buffers anyway). Updated the answer with `DigestInputStream`. – phihag Jul 08 '11 at 13:54
@Blindy Yes, that was poorly expressed. I meant a relative low collision rate (and with SHA2 or better, astronomically small probabilities of any collision *ever* (barring advances in cryptoanalysis)). Is the updated version still incorrect? – phihag Jul 08 '11 at 13:56
1

The main difference between *normal* and *cryptographic* hash functions (assuming same hash output size), is that the cryptographic one protects against *intentional* collisions, while normal hashes only protect against accidental collisions. And normal hashes are usually faster. – Paŭlo Ebermann Jul 08 '11 at 14:23
@Paŭlo Ebermann Thanks, added a hint to the difference in the answer. – phihag Jul 08 '11 at 14:36

Corbin March · Answer 5 · 2011-07-08T15:01:25.703

Take a look at Bob Jenkin's article on non-cryptographic hashing. He walks through a number of approaches and discusses their strengths, weakness, and tradeoffs between speed and the probability of collisions.

If nothing else, it will allow you to justify your algorithm decision. Explain to your instructor why you chose speed over correctness or vice versa.

As a starting point, try his One-at-a-time hash:

ub4 one_at_a_time(char *key, ub4 len)
{
  ub4   hash, i;
  for (hash=0, i=0; i<len; ++i)
  {
    hash += key[i];
    hash += (hash << 10);
    hash ^= (hash >> 6);
  }
  hash += (hash << 3);
  hash ^= (hash >> 11);
  hash += (hash << 15);
  return (hash & mask);
}

It's simple, but does surprisingly well against more complex algorithms.

Is it a good approach to generate hash codes?

5 Answers5