Do you have any simple hash function that takes first three letter?

Question

I'm working at cs50 speller. This is my hash function. It works well. But I think it's not really good because there's a lot of if condition. I try to make a different table for every word that have apostrophe at the first two letter (ex, A', B', C').

So I used buckets (26 x 26 x 26) + 26 = 17602;

unsigned int hash(const char *word)
{
    /*
        this is HASHING algorithms in which we return value with checking first three letter
        make sure every word set to lowercase so we can easily convert as integer value or alphabetical index
        and also set every first element before first letter to be a place for every apostrophe that comes at second word

        [A'=0] [Aaa=1] [Aab=2] ... [Aay=25] [Aaz=26] ... [B'=677] [Baa=678] --- base 676 number
                n = ( 677*first(word ) 26*second(word) + third(word) ) + 1;
    */

    int hash_value = 0;

    if(strlen(word) <= 1)                                       // if there's only one word calculate that word, store to element after apostrophe
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + 1);
        return hash_value;
    }
    else if(!isalpha(word[1]))                                  // if second word contain apostrophe just calucalte first word
    {
        hash_value = (677 * (tolower(word[0]) - 97));
        return hash_value;
    }
    else if(strlen(word) <= 2)                                  // if there's two word calculate that two word, store to element after apostrophe
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + (27 * (tolower(word[1]) - 97))) + 1;
        return hash_value;
    }
    else if(!isalpha(word[2]))                                  // if third word contain apostrophe just calucalte first and two word
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + (27 * (tolower(word[1]) - 97))) + 1;
        return hash_value;
    }
    else
    {
        hash_value = ((677 * (tolower(word[0]) - 97)) + (27 * (tolower(word[1]) - 97)) + (tolower(word[2]) - 97)) + 1;
        return hash_value;
    }
}

It's actually works quite well.

./speller texts/lalaland.txt
WORDS MISSPELLED:     955
WORDS IN DICTIONARY:  143091
WORDS IN TEXT:        17756
TIME IN load:         0.02
TIME IN check:        0.02
TIME IN size:         0.00
TIME IN unload:       0.00
TIME IN TOTAL:        0.05

I just don't like it the way I used a lot of ELSE..IF condition. So maybe you want to help me with better code (with take first three letters).

you can take a pointer to first char, use a while loop and increment pointer. — Ôrel, Sep 27 '20 at 10:15
Kevinkun: Important specification needed: Should the hash be case insensitive? It looks like you want that, but best to be clear. — chux - Reinstate Monica, Sep 27 '20 at 14:25
In 2nd `if`, code has `hash_value = (677 * (tolower(word[0]) - 97));`. Is the +1 missing? — chux - Reinstate Monica, Sep 27 '20 at 15:09
store `strlen(word)` in a variable instead of calling it again and again like that — phuclv, Sep 27 '20 at 15:22

score 1 · Answer 1 · answered Sep 27 '20 at 14:22

1

In some ways, you are implementing a for loop with if statements. An idea like

int hash = 0;
for (char *current = word; *current != '\0'; current++) {
    if (current == &word[2]) break;
    hash += hash(*current);
}

might permit you to avoid the if statements, because now the end-of-index handling is repeated in the loop.

answered Sep 27 '20 at 14:22

Edwin Buck

69,361
7
100
138

OP's hash limits to the first 3 `char`. This runs to the end of the potentially long string and is not case insensitive. – chux - Reinstate Monica Sep 27 '20 at 14:30
@chux-ReinstateMonica Ideas are not finished programs. Typically, by the time one starts to deal with hashing, one can exit a for loop after three iterations. – Edwin Buck Sep 27 '20 at 14:36

score 0 · Answer 2 · answered Sep 27 '20 at 10:52

0

I do not know if I inderstand the logic of this hash but:

tolower returns the same char if the parameter in not upper letter. No need to check if it is alpha

unsigned int hash1(const char *word)
{

    unsigned int result;

    result = (!!word[0]) * (677 * (tolower(word[0]) - 97));
    result += (word[0] && word[1]) * (27 * (tolower(word[1]) - 97));
    result += (word[0] && word[1] && word[2]) * (tolower(word[2]) - 97) + 1;

    return result;
}

answered Sep 27 '20 at 10:52

0___________

60,014
4
34
74

this answer only works if the first three letters are in the range a...z and/or A...Z For a word like the monsanto brand `3m` the result is a LARGE NEGATIVE hash – user3629249 Sep 27 '20 at 18:37
@user3629249 it cant be negative – 0___________ Sep 27 '20 at 18:41
using the example `3m` the result is: `1 * 677 * (51-97)` = −31142 – user3629249 Sep 27 '20 at 18:46
I do not use **`signed`** integers and result cannot be negative – 0___________ Sep 27 '20 at 18:47

chux - Reinstate Monica · Answer 3 · 2020-09-27T15:06:34.563

Simply test isalpha() and take advantage that isalpha(0) is false.
With only up to 3 char, a few if() suffice.

_Static_assert(isalpha(0) == 0, "isalpha(0) unexpectedly true");

unsigned int hashf(const char *word) {
  // best to use unsigned char access for tolower() to avoid UB.
  const unsigned char *uword = word;

  unsigned hash = 677u * (tolower(uword[0]) - 'a') + 1;
  if (isalpha(uword[0]) && isalpha(uword[1])) {
    hash += 27u * (tolower(uword[1]) - 'a');
    if (isalpha(uword[2])) hash += tolower(uword[2]) - 'a';
  }
  return hash;  // see below
}

Simplification possibility: no real need for - 97 or - 'a'. We are making a hash, but obliges a final modulo.

  #define BUCKET_N 17602u
  return hash % BUCKET_N;

IAC, a final modulo by the bucket size makes for robust hash. Consider OP's hash("~"). OP's code assumes character values - 97 are in the range 0 to 25, yet what happens when a character is á? Watch out for character vales outside [0 CHAR_MAX].

Do you have any simple hash function that takes first three letter?

3 Answers3