3

I'm wondering whether it would be a good idea to use hashes (CityHash, Murmur and the like) as keys in a key-value store like Hazelcast. I'm expecting to have about 2,000,000,000 records (URLs) in the database, so collisions could happen. It wouldn't be super critical to lose some data through hash collisions, but of course it would be best to avoid them.

A record contains the URL, time stamp, status code. The main operations are inserting and looking up whether an URL already exists.

So, what would you suggest, given speed is relevant:

  • using an ID generator, or
  • using a hash algorithm like CityHash or Murmur, or
  • using the relevant string, an URL in this case, itself?
deamon
  • 89,107
  • 111
  • 320
  • 448
  • What's the rest of the data you need to store? What type of operations do you need to run? Just insert and check for duplication? Or are you counting visits or reporting on URL histories? Many key-value stores I've seen will handle string keys with hashing behind the scenes, including handling of hash collisions between distinct strings transparently. So adding your own hash code in front may degrade the performance. – Patrick M Jun 10 '15 at 18:06
  • Thanks for your comment. I've added some details to my question. – deamon Jun 10 '15 at 18:52

1 Answers1

3

Hazelcast does not rely on hashCode/equals methods of the key object, instead it is using the MurMur hash of the binary representation of the key.

In short, you should not really worry about hash collisions.

  • Some example with explanation would be great. – Nilambar Sharma Jun 12 '15 at 10:55
  • @Nilambar I don't think I can give any meaningful examples here, since the hashing happens behind the scenes. The relevant code can be found in the following method: com.hazelcast.map.impl.proxy.MapProxyImpl#put(K, V, long, java.util.concurrent.TimeUnit) – Dmytro Kulaiev Jun 12 '15 at 10:58