0

I've an assignment (CS50 - speller) where I have to implement a Hash table with linked lists. However for extra challenge we were made to implement an hashing algorithm too, I am completely new to hash tables and hashing and know 0 about cryptography; after reading looking for a while I've found the djb2 hash that I think will work well with my dataset (a dictionary of 143k lowercase words (some with ')) that I'll have to use to spell check other datasets.

My original thought after analysing the dataset was to split it into the first three letters and then have an (total 3 letter variations on my dataset element) array of 3 chars containing the head of a binary three of linked lists with each word. (I can't do that because the exercise already contains a struct of a sllist and a prototype for a hash function)

This of course was before learning that hash tables are called that because they use a hash. I'm completely blindfolded on how to proceed.

I've seen that people frequently use mod % to map it to their list but this confuses me as how can you guarantee that there wont be more collisions that way AND what would be the optimal array size to minimize them?

How would I map the results of djb2 function to a hash table? Is there a better approach for my case?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
MiguelP
  • 416
  • 2
  • 12
  • 1
    The only thing that's required for a valid hash function is that it produces a number for all inputs and `X==Y => h(X)==h(Y)`. Your code should behave correctly (albeit slowly) even if it's all collisions (e.g. `h(X)=0`). As a step two, a good hash function should minimize collisions. If you have a specific data set in mind, you can measure how many collisions happened with different hash functions and numbers of buckets. –  Jul 29 '21 at 13:51
  • 1
    But yes, if you have N buckets and a hash function that returns arbitrary ints, `%N` is the usual method. "How can you guarantee...?" we cannot. No hash function is going to perform flawlessly on all possible data. Measure. –  Jul 29 '21 at 14:33

2 Answers2

2

You use modulo. If you say the size is always a power of 2, then you can also calculate the same thing using bitwise-AND.

how can you guarantee that there wont be more collisions that way AND what would be the optimal array size to minimize them?

You can't. And nobody knows. Except it should be as big as possible, but not so big it uses up all your memory.

Hash tables are fundamentally probabilistic data structures. There is no way to ensure they are completely, 100%, perfect in every way. You can only get "perfect enough" which is usually something like 95% perfect. If 5% of your buckets have two items in them... big deal, who cares. 95% of the time you only have to check one item and 5% of the time you still only have to check two.

Every hash table can have collisions. If it's a good hash function, the items go in the buckets completely randomly - as close as anyone can tell. If you have 5 items and 10 buckets there's about a 50% chance that bucket 1 has an item in it (actually 41%). There's about a 7% chance it has 2 items in it. There's about a 0.8% chance it has 3 items in it.

The way to deal with this is to make sure your hash table can have more than one item in the same bucket, but it doesn't have to be fast, because it doesn't happen very often. A linked list is one way. A better way (because of CPU caches) is to use the next bucket instead, which is called open addressing, but it's complicated.

Those probabilities go up quickly if you start putting, say, 10 items into 10 buckets. To make sure the probabilities stay low, most hashtables will expand their size when they're about 50% to 75% "full" (when the number of items, divided by the number of buckets, gets above some number they choose between 0.5 and 0.75).

You can also have a high number of items in one bucket if you have a bad hash function, for example

int hash(const char *s) {return 0;}

will put every item into the same bucket no matter how your hash table tries to distribute them - whether it uses modulo, or something else. That's why a good hash function is essential.

user253751
  • 57,427
  • 7
  • 48
  • 90
1

I believe there are three things you need to know about hash functions:

  1. You want to boil an N-byte string down to a one-int number.
  2. You want to further boil that one-int number down to the number of "buckets" in your hash table. The tool of choice for this is of course the modulo operator, %.
  3. Doing this really well is surprisingly hard, but if you're just getting started, even a crappy hash function will do.

There are lots of ways of doing #1. You can just add up the byte values of the characters in your string:

unsigned int hash1(const char *str)
{
    unsigned int hash = 0;
    unsigned char *p;
    for(p = str; *p != '\0'; p++)
        hash += *p;
    return hash;
}

Or you can exclusive-OR together the the byte values of the characters in your string:

unsigned int hash2(const char *str)
{
    unsigned int hash = 0;
    unsigned char *p;
    for(p = str; *p != '\0'; p++)
        hash ^= *p;
    return hash;
}

(Jumping ahead to point 3, both of these end up being really horrible, but they'll do for the moment.)

Up in the caller, you typically take the return value of one of these guys, and use % to turn it into an index into your hash table:

#define HASHSIZE 37
HashtabEnt hashtab[HASHSIZE];

// ...

unsigned ix = hash(string) % HASHSIZE;
x = hashtab[ix];

// ...

And then the big question is, how do you write a good hash function? It's actually an area of considerable and ongoing theoretical interest, and I'm no expert, so I'm not going to try to give you a complete treatment. At the very least you need to make sure that every byte of the input has some effect on the output. Ideally you want to be able to generate values that cover the output range completely. Preferably it will generate output values that cover the output range with a reasonably uniform distribution. If you needed a cryptographically secure hash you would have additional requirements, but for simple dictionary-style hashing you don't have to worry about those.

My function hash2 up above is bad because it never generates a hash value greater than 255 (that is, of more than 8 bits, so it probably fails on "cover the output range completely"). hash1 isn't much better, because unless the input string is large, it won't get much beyond 8 bits. A simple improvement is to combine a shift and an exclusive-OR:

unsigned int hash3(const char *str)
{
    unsigned int hash = 0;
    unsigned char *p;
    for(p = str; *p != '\0'; p++)
        hash = (hash << 1) ^ *p;
    return hash;
}

But this is no good, either, because it always shifts bits off to the left, meaning that the final hash value ends up being a function of only the last few input bytes, not all of them -- that is, it fails on "every byte of the input has some effect on the output".

So another approach is to do a circular shift, and then an exclusive-OR of the next byte:

unsigned int hash4(const char *str)
{
    unsigned int hash = 0;
    unsigned char *p;
    for(p = str; *p != '\0'; p++)
        hash = ((hash << 1) & 0xffff | (hash >> 15) & 1) ^ *p;
    return hash;
}

This is the traditional algorithm used by the Unix "sum" command.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103