1

I am implementing hash function in order to check the anagrams, but I am not getting desired output. Could you suggest what went wrong?

Output:

key[148]:val[joy]
key[174]:val[jam]
key[294]:val[paula]
key[13]:val[ulrich]
key[174]:val[cat]
key[174]:val[act]
key[148]:val[yoj]
key[265]:val[vij]
key[265]:val[jiv]

Here key value 174 is fine for strings act and cat (anagrams) but same can't be expected with jam.

Below is the code snippet.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

unsigned long hash(char *str, size_t size) {
    unsigned long hash_val = 5381;
    unsigned long sum = 0;
    char *val;
    int i, j;
    for (j = 0; j < 9; j++) {
        val = malloc(strlen(str) + 1);
        memset(val, '\0', strlen(str) + 1);
        strcpy(val, str);

        for (i = 0; val[i] != '\0'; i++) {
            sum = sum + val[i];
        }
        return size % sum;
    }
}

int main() {
   int i;
   char *str[9] = { "joy", "jam", "paula", "ulrich","cat", "act","yoj", "vij", "jiv" };
   unsigned long key;
   size_t size = 4542; // it may be anything just for test it is being used
   for (i = 0; i < 9; i++) {
        key = hash(str[i], size);
        printf("\nkey[%ld]:val[%s]", key, str[i]);
    }
    return 1;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
pri
  • 119
  • 1
  • 9
  • Why down votes for the same. I was expecting the resolution of this problem. I know what are the pitfalls of current approach so was just expecting proper solution of this issue not down votes. – pri Aug 05 '15 at 09:02

3 Answers3

3

Yes, it can, because your hash function is very poorly written - it returns your constant 'size' variable modulo sum of all the string characters.

The problem is that the sum of ASCII codes 'c' + 'a' + 't' is equal to the 'j' + 'a' + 'm' (equal to 312) so you are getting the same value for your 'hash'.

You could use a 'normal' (e.g. polynomial) hash function for your anagram table, but with sorted strings - that would be the easiest approach.

For another method, you can calculate a number of appearances of each letter in the string (a histogram) and hash (or just store as is) them instead.

I recommend you to do some research on this topic as it's a very common task.

Also, you could just sort the strings and let unordered_set<string> do the job for you.

dreamzor
  • 5,795
  • 4
  • 41
  • 61
  • do you have any algorithm for the same? – pri Aug 05 '15 at 08:59
  • @pri yes, as I said, you can use any hashing algorithm, e.g. http://stackoverflow.com/questions/7666509/hash-function-for-string but sort a string before you calculate the hash – dreamzor Aug 05 '15 at 09:12
  • humm. I was aware of this link but didn't think about the sorting , thanks for your help. +1 from my end as well to you. – pri Aug 05 '15 at 09:18
2

but same can't be expected with jam.

Well, there you go wrong. Let's see your algo. What you're doing is basically summing up the ASCII value of the elements of the strings, and returning the modulus result of a fixed value taken with respect to the sum.

To elaborate, as per the ASCII table,

j == 106
a == 97
m == 109

and

c == 99
a == 97
t == 116

Both the words end up having a sum result of 312. Now as per your algo,

 4542 % 312

is suppose to give a constant value, right? That is what it is giving.

Now, don't be "sad", as

s == 115
a ==97
d == 100

that also comes up with 312.

That said, I see you have a local variable unsigned long hash_val = 5381; defined inside your function, but used nowhere.

Sourav Ghosh
  • 133,132
  • 16
  • 183
  • 261
  • Thanks for your observation.currently there is no use of that hash_val in my implementation. I also had also performed the sum of ASCII values and found that sum will be same for cat and jam but is there anyway to improve this algorithm as i don't want to perform additional check for similar sum values in order to make them different. – pri Aug 05 '15 at 08:19
0

Your hash function has many problems:

  • The for (j = 0; j < 9; j++) loop is completely useless.
  • It is utterly inadequate to allocate memory for a copy of the string, and to forget to free it! Just use the string directly.
  • You summing method has too many easy collisions, as you diagnosed: anagrams produce the same sum, but also many simple words. You should shuffle the sum between before each character value is added.
  • return size % sum; should really be return sum % size; so the return value can be used as an index into the hash table of size size. As a matter of fact, size % sum would invoke undefined behavior if sum happened to compute to 0, which would require a very long string (>16MB) but is possible.

Here is an improved hash function:

#include <limits.h>

// constraints: str != NULL, size > 0
size_t hash(const char *str, size_t size) {
    size_t sum = 5381;  // initial salt
    while (*str != '\0') {
        // rotate the current sum 2 places to the left
        sum = (sum << 2) | (sum >> (CHAR_BIT * sizeof(sum) - 2));
        // add the next character value
        sum += (unsigned char)*str++;
    }
    return sum % size;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189