4

Looking for some hash function to make string to int mapping with following restrictions.

restrictions: Same strings go to same number. Different strings go to different numbers. During one run of application I am getting strings from same length, only in the runtime I know the length.

Any suggestions how to create the hash function ?

Night Walker
  • 20,638
  • 52
  • 151
  • 228

4 Answers4

4

A hash function does never guarantee that two different values (strings in your case) yield different hash codes. However, same values will always yield the same hash codes.

This is because information gets lost. If you have a string of a length of 32 characters, it will have 64 bytes (2 bytes per char). An int hash code has four bytes. This is inevitable and is called a collision.

Note: Dictionary<Tkey,TValue> uses a hash table internally. Therfore it implements a collision resolution strategy. See An Extensive Examination of Data Structures Using C# 2.0 on MSDN.

Here is the current implementation of dictionary.cs.

Olivier Jacot-Descombes
  • 104,806
  • 13
  • 138
  • 188
3

You aren't going to find a hash algorithm that guarantees that the same integer won't be returned for different strings. By definition, hash algorithms have collisions. There are far more possible strings in the world than there are possible 32-bit integers.

Robert Levy
  • 28,747
  • 6
  • 62
  • 94
3

Different strings go to different numbers.

There are more strings than there are numbers, so this is flat out impossible without restricting the input set. You can't put n pigeons in m boxes with n > m without having at least one box contain more than one pigeon.

jason
  • 236,483
  • 35
  • 423
  • 525
1

Is the String.GetHashCode function not right for your needs?

Jesse Smith
  • 963
  • 1
  • 8
  • 21
  • 3
    It doesn't satisfy the impossible requirement that different strings go to different numbers. – jason Jan 25 '12 at 15:35
  • @Jason: True, but the high probabiliy ensured by GetHashCode might suffice for the OP's requirements. – Heinzi Jan 25 '12 at 15:37
  • @Heinzi - can he call you to debug when his app suddenly stops working a few years from now? :) – Robert Levy Jan 25 '12 at 15:40
  • 2
    @Heinzi: `GetHashCode` does not ensure anything like that! In fact, it's highly likely that you do get collisions. It's the birthday problem, just with more days in the calendar. With `2^32` buckets, it only takes around 75000 strings to have a greater than 0.5 chance of a collision. – jason Jan 25 '12 at 15:43
  • @Jason: Good point about the birthday paradox. As I said, it depends on the OPs requirements. If he wants to implement a hash table for random string values, GetHashCode is fine. If he wants to check for equality, no hash function is sufficent. – Heinzi Jan 25 '12 at 15:57
  • 1
    The "high probability" of uniqueness is an illusion. I've found that when hashing strings the likelihood of getting a collision is 50% after about 70,000 strings are generated. I've *never* found a case when I could insert 200,000 strings without generating a duplicate. This holds true for real-world data as well as for randomly generated test data in multiple applications I've written over the past 15 years. – Jim Mischel Jul 20 '16 at 02:43