2

I'm working at cs50 speller. This is my hash function. It works well. But I think it's not really good because there's a lot of if condition. I try to make a different table for every word that have apostrophe at the first two letter (ex, A', B', C').

So I used buckets (26 x 26 x 26) + 26 = 17602;

unsigned int hash(const char *word)
{
    /*
        this is HASHING algorithms in which we return value with checking first three letter
        make sure every word set to lowercase so we can easily convert as integer value or alphabetical index
        and also set every first element before first letter to be a place for every apostrophe that comes at second word

        [A'=0] [Aaa=1] [Aab=2] ... [Aay=25] [Aaz=26] ... [B'=677] [Baa=678] --- base 676 number
                n = ( 677*first(word ) 26*second(word) + third(word) ) + 1;
    */

    int hash_value = 0;

    if(strlen(word) <= 1)                                       // if there's only one word calculate that word, store to element after apostrophe
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + 1);
        return hash_value;
    }
    else if(!isalpha(word[1]))                                  // if second word contain apostrophe just calucalte first word
    {
        hash_value = (677 * (tolower(word[0]) - 97));
        return hash_value;
    }
    else if(strlen(word) <= 2)                                  // if there's two word calculate that two word, store to element after apostrophe
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + (27 * (tolower(word[1]) - 97))) + 1;
        return hash_value;
    }
    else if(!isalpha(word[2]))                                  // if third word contain apostrophe just calucalte first and two word
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + (27 * (tolower(word[1]) - 97))) + 1;
        return hash_value;
    }
    else
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + (27 * (tolower(word[1]) - 97)) + (tolower(word[2]) - 97)) + 1;
        return hash_value;
    }
}

It's actually works quite well.

./speller texts/lalaland.txt
WORDS MISSPELLED:     955
WORDS IN DICTIONARY:  143091
WORDS IN TEXT:        17756
TIME IN load:         0.02
TIME IN check:        0.02
TIME IN size:         0.00
TIME IN unload:       0.00
TIME IN TOTAL:        0.05

I just don't like it the way I used a lot of ELSE..IF condition. So maybe you want to help me with better code (with take first three letters).

Vishnu CS
  • 748
  • 1
  • 10
  • 24

3 Answers3

1

In some ways, you are implementing a for loop with if statements. An idea like

int hash = 0;
for (char *current = word; *current != '\0'; current++) {
    if (current == &word[2]) break;
    hash += hash(*current);
}

might permit you to avoid the if statements, because now the end-of-index handling is repeated in the loop.

Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • OP's hash limits to the first 3 `char`. This runs to the end of the potentially long string and is not case insensitive. – chux - Reinstate Monica Sep 27 '20 at 14:30
  • @chux-ReinstateMonica Ideas are not finished programs. Typically, by the time one starts to deal with hashing, one can exit a for loop after three iterations. – Edwin Buck Sep 27 '20 at 14:36
0

I do not know if I inderstand the logic of this hash but:

  1. tolower returns the same char if the parameter in not upper letter. No need to check if it is alpha
unsigned int hash1(const char *word)
{

    unsigned int result;

    result = (!!word[0]) * (677 * (tolower(word[0]) - 97));
    result += (word[0] && word[1]) * (27 * (tolower(word[1]) - 97));
    result += (word[0] && word[1] && word[2]) * (tolower(word[2]) - 97) + 1;

    return result;
}
0___________
  • 60,014
  • 4
  • 34
  • 74
0

Simply test isalpha() and take advantage that isalpha(0) is false.
With only up to 3 char, a few if() suffice.

_Static_assert(isalpha(0) == 0, "isalpha(0) unexpectedly true");

unsigned int hashf(const char *word) {
  // best to use unsigned char access for tolower() to avoid UB.
  const unsigned char *uword = word;

  unsigned hash = 677u * (tolower(uword[0]) - 'a') + 1;
  if (isalpha(uword[0]) && isalpha(uword[1])) {
    hash += 27u * (tolower(uword[1]) - 'a');
    if (isalpha(uword[2])) hash += tolower(uword[2]) - 'a';
  }
  return hash;  // see below
}

Simplification possibility: no real need for - 97 or - 'a'. We are making a hash, but obliges a final modulo.

  #define BUCKET_N 17602u
  return hash % BUCKET_N;

IAC, a final modulo by the bucket size makes for robust hash. Consider OP's hash("~"). OP's code assumes character values - 97 are in the range 0 to 25, yet what happens when a character is á? Watch out for character vales outside [0 CHAR_MAX].

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256