0

In case you are not familiar with universal hashing, it's mainly an attempt to guarantee a low number of collisions (as opposed, say with using plain old modulo), using some rather simple math involving randomness. The problem is it doesn't work for me:

size_t hash_modulo(const int value) {
    return (size_t) (value % TABLE_SIZE);
}

// prime 491 is used because its > 128, which is the size of the hash table
size_t hash_universal(const int value) {
    const size_t a = (size_t) (rand() % 491 + 1);
    const size_t b = (size_t) (rand() % 491);
    //printf("a: %zu, b:%zu\n", a, b);
    return ((a * value + b) % 491) % TABLE_SIZE;
}

I test modulo hashing first and determine the longest chain length (chain length means a hash bucket size):

size_t get_max_chain_length(int input[TABLE_SIZE], size_t (*hash_function)(const int)) {
    HashTable *hash_table = hash_table_create(hash_function);
    if (!hash_table) {
        return 0;
    }

    for (size_t i = 0; i < TABLE_SIZE; ++i) {
        hash_table_add(hash_table, input[i]);
    }

    size_t maximum_chain_length = 0;
    for (int j = 0; j < TABLE_SIZE; ++j) {
        const size_t length = length_of_(hash_table->rows[j]);
        maximum_chain_length = (length > maximum_chain_length) ? length : maximum_chain_length;
    }

    //hash_table_print(hash_table);
    hash_table_destroy(hash_table);

    return maximum_chain_length;
}

I pick one of the inputs which led to a really big chain (id est one which performs bad using plain modulo) and throw this one against universal hashing. Universal hashing uses randomness so I can take a constant input and still get varying results.

And here comes the problem. I try 100 random input arrays of size 128 each and calculate the average longest chain and the total longest chain, but both algorithms perform similar.

You can check my main in my repo.

My question is: Is that result to be expected? Does universal hashing perform not any better with input which already performed poor using modulo? Or did I just screw up my implementation (more likely).

Thanks a lot in advance!

AdHominem
  • 1,204
  • 3
  • 13
  • 32
  • 3
    Wait, you're recomputing `a` and `b` for every single hash access? How does that make sense? – melpomene Dec 23 '16 at 09:35
  • were `a` and `b` supposed to be `static` in this attempt? – WhozCraig Dec 23 '16 at 09:51
  • @melpomene: If they were static, the function would always hash the same input into the same bucket? – AdHominem Dec 23 '16 at 10:21
  • @WhozCraig see above – AdHominem Dec 23 '16 at 10:33
  • @AdHominem Reading your linked article, it makes more sense now. Thanks. – WhozCraig Dec 23 '16 at 10:34
  • @AdHominem How does your approach retrieve the stored elements? – 2501 Dec 23 '16 at 10:36
  • @2501 Elements are stored in an array (so they can be referenced later). You can see the implementation in my repo link. – AdHominem Dec 23 '16 at 11:02
  • @WhozCraig It is still unclear whether the numbers are chosen at random once for each hash table or once for each insertion function call. My tests suggest it doesnt have any impact – AdHominem Dec 23 '16 at 11:03
  • Your implementation doesn't have any get functions. All it can do it iterate through all of the elements. – 2501 Dec 23 '16 at 11:06
  • @2501 Yes, what would I need a get function for? All I want is to store the inputs temporarily with their longest chain to compare them and determine a longest. – AdHominem Dec 23 '16 at 11:09
  • 1
    Because there is no way of actually implementing a usable hash table with your approach. – 2501 Dec 23 '16 at 11:11
  • @2501 I understand your point now. Implementing such method is not related to the efficiency problem tho. I was just asking why the maximum bucket size does not seem to change a lot using either algorithm, but I figured the only difference is when you insert really uniform input. – AdHominem Dec 23 '16 at 11:51

2 Answers2

1

Well, why do you think modulo is bad? If the input is random and sufficiently large, the modulo should yield a uniformly distributed result. Uniform hashing (as your link states) provides protection against non-random (i.e., malicious) input, which isn't the case here.

SomeWittyUsername
  • 18,025
  • 3
  • 42
  • 85
  • Well thats why I take the worst possible distribution from modulo, to check if it would look any better using universal hashing. Is that methodology flawed? – AdHominem Dec 23 '16 at 10:09
  • What is worst possible distribution? If the input is large enough any random input should converge to uniform. – SomeWittyUsername Dec 23 '16 at 10:15
  • Hmm, maybe just run the code once. I generate random input and chose one which leads to a big bucket using modulo. Then I use exactly the same input with universal to check if the result improves. And according to the specification, the maximum chain length should at least reduce a bit. – AdHominem Dec 23 '16 at 10:29
0

In case you are not familiar with universal hashing, it's mainly an attempt to guarantee a low number of collisions …

An "attempt to guarantee" is no guarantee.

… (as opposed, say with using plain old modulo)…

The linked article says that even simpler hash functions [using modulo] are approximately universal.

I generate random input and chose one which leads to a big bucket using modulo. Then I use exactly the same input with universal to check if the result improves. And according to the specification, the maximum chain length should at least reduce a bit.

100 random input arrays of size 128 each aren't a particularly large input. I ran the program from your repo eight times; in five of those runs, your universal hashing reduced the "Average maximum chain length" by around 10 %; in three runs, your universal hashing increased the "Average maximum chain length" by a similar amount. I notice that the maximum chain length with universal hashing is constant within each run.

To sum up, there's no guarantee that one hash method is always better than another, and your universal hashing seems to keep its performance promise yet by being better more often than not.

Armali
  • 18,255
  • 14
  • 57
  • 171