I believe there are three things you need to know about hash functions:
- You want to boil an N-byte string down to a one-
int
number.
- You want to further boil that one-
int
number down to the number of "buckets" in your hash table. The tool of choice for this is of course the modulo operator, %
.
- Doing this really well is surprisingly hard, but if you're just getting started, even a crappy hash function will do.
There are lots of ways of doing #1. You can just add up the byte values of the characters in your string:
unsigned int hash1(const char *str)
{
unsigned int hash = 0;
unsigned char *p;
for(p = str; *p != '\0'; p++)
hash += *p;
return hash;
}
Or you can exclusive-OR together the the byte values of the characters in your string:
unsigned int hash2(const char *str)
{
unsigned int hash = 0;
unsigned char *p;
for(p = str; *p != '\0'; p++)
hash ^= *p;
return hash;
}
(Jumping ahead to point 3, both of these end up being really horrible, but they'll do for the moment.)
Up in the caller, you typically take the return value of one of these guys, and use %
to turn it into an index into your hash table:
#define HASHSIZE 37
HashtabEnt hashtab[HASHSIZE];
// ...
unsigned ix = hash(string) % HASHSIZE;
x = hashtab[ix];
// ...
And then the big question is, how do you write a good hash function? It's actually an area of considerable and ongoing theoretical interest, and I'm no expert, so I'm not going to try to give you a complete treatment. At the very least you need to make sure that every byte of the input has some effect on the output. Ideally you want to be able to generate values that cover the output range completely. Preferably it will generate output values that cover the output range with a reasonably uniform distribution. If you needed a cryptographically secure hash you would have additional requirements, but for simple dictionary-style hashing you don't have to worry about those.
My function hash2
up above is bad because it never generates a hash value greater than 255 (that is, of more than 8 bits, so it probably fails on "cover the output range completely"). hash1
isn't much better, because unless the input string is large, it won't get much beyond 8 bits. A simple improvement is to combine a shift and an exclusive-OR:
unsigned int hash3(const char *str)
{
unsigned int hash = 0;
unsigned char *p;
for(p = str; *p != '\0'; p++)
hash = (hash << 1) ^ *p;
return hash;
}
But this is no good, either, because it always shifts bits off to the left, meaning that the final hash value ends up being a function of only the last few input bytes, not all of them -- that is, it fails on "every byte of the input has some effect on the output".
So another approach is to do a circular shift, and then an exclusive-OR of the next byte:
unsigned int hash4(const char *str)
{
unsigned int hash = 0;
unsigned char *p;
for(p = str; *p != '\0'; p++)
hash = ((hash << 1) & 0xffff | (hash >> 15) & 1) ^ *p;
return hash;
}
This is the traditional algorithm used by the Unix "sum" command.