2

As the question states, how calculate the optimal number to use and how to motivate it?

If we are going to build an hashtable which uses the following hash function:

h(k) = k mod m, k = key

So some sources tells me:

  1. to use the number of elements to be inserted as the value of m
  2. to use a close prime to m
  3. that java simply use 31 as their value of m
  4. And some people tell me to use the closed prime to 2^n as m

I'm so confused at this point that I don't know what value to use for m. Like for instance if we use the table size for m then what happens if we want to expand the table size? Will I then have to rehash all the values with the new value of m. If so why does Java use simply 31 as prime value for m.

I've also heard that the table size should be two times bigger then the total elements in the hashtable, that's for each time it rehashes. But how come we for instance use m=10 for a table of 10 elements when it should be m=20 to create that extra empty space?

Can someone please help me understand how to calculate the value of m to use based on different scenarios like when we want to have a static (where we know that we will only insnert like 10 elements) or dynamic (rehash after a certain limit) hashtable.

Lets illustrate my problem by the following examples:

I got the values {1,2,...,n}

Question: What would be a optimized value of m if I must use the division by mod in my hashfunction?

Senario 1: n = 100?

Senario 2: n = 5043?

Addition question: Would the m value hashfunction be different if we used a open or closed hashtable?

Note that i'm now not in need to understand hashtable for java but hashtable in general where I must use a divsion mod hashfunction.

Thank you for your time!

Alexander
  • 328
  • 2
  • 10
  • Java uses whatever you tell it to by overriding `hashCode`. The typical idiom involves iterated *multiplication* by 31 when combining the hashcodes of individual object attributes, and doesn't use modulus at all. – Marko Topolnik Aug 25 '13 at 10:50
  • Yes, that answers that question but how about if we make it more generalised in the context of hashfunction with division of mod. Would 31 be an ideal prime number to use for an n < 31, if so what if the size of n is larger the 31 what will it then be? – Alexander Aug 25 '13 at 10:59
  • The hash function itself has nothing to do with the size of the hashtable. Its main desirable characteristic is that it disperses the values well. Note that in Java, the hash function belongs to the key object, whereas the hashtable size is obviously encapsulated in a totally different, independent class. You can't optimize hashCode for a specific hashtable size. – Marko Topolnik Aug 25 '13 at 11:01
  • Unless your keys are integers, your `h(k)` is not a hash function (or any function at all - it's simply ill-formed). Instead it looks like the code which turns a hash value into a table index. Then m is the table size. –  Aug 25 '13 at 11:45

1 Answers1

0

You have several issues here: 1) What should m equal? 2) How much free space should you have in your hash table? 3) Should you make the size of your table be a prime number?

1) As was mentioned in the comments, the h(k) you describe isn't the hash function, it gives you the index into your hash table. The idea is that every object produces some hash code, which is a positive integer. You use the hash code to figure out where to put the object in the hash table (so that you can find it again later). You clearly don't want a hash table of size MAX_INT, so you choose some size m. Then for any object, you take its hash code, compute k % m, and now you have an integer in the interval [0, m-1], which is a valid index into your hash table.

2) Because a hash table works by using a hash code to find the place in a table where an object should go, you get into trouble if multiple items are assigned to the same location. This is called a collision. Every hash table implementation must deal with collisions, either by putting items into nearby spots or keeping a linked list of items in each location. No matter the solution, more collisions means lower performance for your hash table. For that reason, it is recommended that you not let your hash table fill up, otherwise, collisions are more likely. Keeping your hash table at least twice as large as the number of items is a common recommendation to reduce the probability of collisions. Obviously, this means you will have to resize your table as it fills up. Yes, this means that you have to rehash each item since it will go into a different location when you are taking a modulus by a different value. That is the hidden cost of a hash table: it runs in constant time (assuming few or no collisions), but it can have a large coefficient (ammortized resizing, rehashing, etc.).

3) It is also often recommended that you make the size of your hash table be a prime number. This is because it tends to produce a better distribution of items in your hash table in certain common use cases, thus avoiding collisions. Rather than giving a complete explanation here, I will refer you to this excellent answer: Why should hash functions use a prime number modulus?

Community
  • 1
  • 1
Jeremy West
  • 11,495
  • 1
  • 18
  • 25