-8

Can anyone please explain how this hash function work? I have spent a lot of time trying to figure it out and still don't know how it works.

Full code is from https://gist.github.com/choaimeloo/ffb96f7e43d67e81f0d44c08837f5944#file-dictionary-c-L30

// Hashes the word (hash function posted on reddit by delipity)
// The word you want to hash is contained within new node, arrow, word.
// Hashing that will give you the index. Then you insert word into linked list.

int hash_index(char *hash_this)
{
    unsigned int hash = 0;
    for (int i = 0, n = strlen(hash_this); i < n; i++)
    {
        hash = (hash << 2) ^ hash_this[i];
    }
    return hash % HASHTABLE_SIZE;
}

I dont understand why he uses (<< and ^) ?

Also why did he use strlen(hash_this)?

AlhasanY
  • 3
  • 4
  • What do you need for a hash function `h`? `if x == y then h(x) == h(y)` if you satisfy this, you have a valid hash function. Ideally your function should also assign different hash values to different inputs. (hashing everything to zero is valid, but will perform like crap, because everything will go into collisions) ... So the function is frobbing the input value around in a deterministic fashion, it's obviously valid, bit operations are cheap, why not, whether it will produce collisions or not depends on the set of inputs it gets. –  Nov 19 '20 at 10:33
  • Yeah this make it more clear. Now i am interested in hashing everything to zero to test the timing on that code :) – AlhasanY Nov 22 '20 at 06:09

2 Answers2

0

The purpose of a hash function ist to retrieve a unique value for a given sequence (of bytes, characters, ...).

Therefore you need the length of the sequence, here with 'strlen'.

Without bit shift operator (<<) you would get the same result for the sequence 'abc' and 'cba'.

The xor operator (^) 'scrambles' / 'hashes' the current value further, so it becomes more unlikley, that similar sequences results in an equivalent value (imagine sequences with a certain pattern, like 'abcabc...').

Erdal Küçük
  • 4,810
  • 1
  • 6
  • 11
0

He’s using strlen because he’s iterating through the string and processing each character. He could also test that hash_this[i] is not zero:

for ( int i = 0; hash_this[i] != 0; i++ )
  ...

which would do the same thing.

The bitwise operators prevent the hash function from computing the same index for different combinations of the same letters. You want hash_index( "bat" ) to return a different value then hash_index( "tab" ).

Returning the same index for different strings is known as a collision and it’s something you want to avoid, so most good hash functions do some kind of arithmetic or bitwise operation on each character to minimize the possibility.

John Bode
  • 119,563
  • 19
  • 122
  • 198