0

I'm writing double hash table which only takes integer.

unsigned int DoubleHashTable::HashFunction1(unsigned int const data)
{
   return (data % GetTableSize());
}

unsigned int DoubleHashTable::HashFunction2(unsigned int const data, unsigned int count)
{
   return ((HashFunction1(data) + count * (5 - (data % 5)) % GetTableSize()));
}

and trying to insert data into table with SetData()

void DoubleHashTable::SetData(unsigned int const data)
{
   unsigned int probe = HashFunction1(data);

   if (m_table[probe].GetStatus())
   {
      unsigned int count = 1;
      while (m_table[probe].GetStatus() && count <= GetTableSize())
      {
         probe = HashFunction2(data, count);
         count++;
      }
   }

   m_table[probe].Insert(data);
}

After put 100 of integer items into table with size of 100, table shows me some of indexes are left as blank. I know, it will takes O(N) which is worst case. My question is, item should be inserted into table with no empty space even it takes worst case of search time, right? I can't find the problem of my functions.

Additional question. There are well-known algorithms for hash and purpose of double hashing is makes less collision as much as possible, H2(T) is backup for H1(T). But, if well-known hashing algorithm (like MD5, SHA and other, I'm not talking about security, just well-known algorithm) is faster and well-distribute, why we need a double hashing?

Thanks!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
user58569
  • 47
  • 1
  • 5
  • 1
    Double hashing is sometimes useful because there is no such thing as a perfect hash function. Best we can do is minimize collisions. Also, what integers are you hashing? – Drakes Apr 14 '15 at 23:38
  • just using rand() function to generates random number as unsigned int type. My table size is 101 (prime number) and put 101 items into table. – user58569 Apr 14 '15 at 23:48
  • I suggest to insert a large number of random numbers, say 1000 * tableSize, and see how filled each slot is. Distribution should be about even. – Drakes Apr 15 '15 at 00:38
  • I tried to insert random number which is always higher than 10000, rand() % 10000 + 10000. A result has 5 empty spaces left. (size of table is 101) Actually, I cannot follow your comment, what does the large number can makes difference? In my head, if there is no problem, table always has to be full filled. – user58569 Apr 15 '15 at 02:53
  • I found the problem, HashFunction2() returns wrong index number which is higher than size of table... (when size of table is 11, it returns 13) I understand the concept of double hashing and difference between linear hashing with really good computation. Is there a rule for H2(T)? The H2(T) that I'm using now is from the internet, I just read ppt files from some university's CS class... – user58569 Apr 15 '15 at 03:04
  • I'll add an answer to explain my comment if you like – Drakes Apr 15 '15 at 03:15
  • I'm fine weather answer under comment or reply my post. Happy to waiting. Thanks! – user58569 Apr 15 '15 at 03:49
  • @Drakes: your comment implies you're expecting open hashing (e.g. linked lists of elements colliding at a bucket), but this question is about closed hashing (aka open addressing) - moving through a repeatable sequence of alternative buckets until an unused one is found. With closed hashing, you can't insert `1000*tableSize` elements - the `tableSize` is a hard upper limit, and it's usually a good idea to fill it no more than 80-90% as even with an excellent hash function and collision-handing it will start to slow down exponentially (due to multiple collisions before finding empty buckets). – Tony Delroy Apr 15 '15 at 04:15
  • Yes, you are right. This test would be for hash-chaining. For open-addressing, the buckets should be filled after 101 unique entries. My answer is just to your second question since you fixed your code. :) – Drakes Apr 15 '15 at 04:24

1 Answers1

1

When testing hash functions, there may be high collisions with certain pathological inputs (=those which break your hash function). These inputs can be discovered by reversing the hash function which can lead to certain attacks (this is a real concern as internet routers have limited space for hash tables). Even with no adversary, the look up time of such a hash table after certain inputs can grow and even become linear in the worst case.

Double hashing is a method of resolving hash collisions to try to solve the problem of linear growth on pathological inputs. Linear probing or open addressing are popular choices. However, the number of inputs must be much lower than the table size in these cases, unless your hash table can grow dynamically.

To answer your second question (now that you have fixed your code on your own), in a nutshell, double hashing is better-suited for small hash tables, and single hashing is better-suited for large hash tables.

Drakes
  • 23,254
  • 3
  • 51
  • 94
  • Thanks Drakes! Now I can understand reasons for double hashing and linear hashing. I know my 2nd hash function returns incorrect index of table and makes empty spaces when I put items which is same size of table. How can I design the hash functions for double hashing so that it works better than linear hashing? (I already did for open addressing with linear and chaining, I just used (data % size of table) to find index but I need a 2nd hash function for double hashing.) I still looking for nice hash functions for double hashing but hard to find. – user58569 Apr 15 '15 at 19:11
  • What I understand from your answer is, if there is no perfect hash functions for the situation when number of given inputs are (almost)same as size of table but table isn't extended when loadfactor is approached to 70%, hash table starts to failed? If what I understand is right, all kinds of hash table (except chaining unless it needs) always grown up its size when usage of table is exceed than their standard percentage? (like 70% ~ 75%) – user58569 Apr 15 '15 at 19:19
  • Here are some examples of more probing and double hashing functions with animations. It's fun to step through. http://algoviz.org/OpenDSA/Books/OpenDSA/html/HashCImproved.html – Drakes Apr 15 '15 at 22:01
  • Also, even with a huge hash-chaining table, if the hash function is known then there is _always_ a way to compute collisions that will slow down the lookup. My favorite example is the router [hash table attack](https://www.eng.tau.ac.il/~yash/C2_039_Wool.pdf) (PDF) – Drakes Apr 15 '15 at 22:05