1

What are the best prime numbers to choose for a double hashed hash table size?

side info

  • the hash table is part of a word analysis project, Markov models, training bots to model and generate text as if someone else would write it (which takes a lot of words, sentences, transcripts, books... the bigger the corpus, the better)
  • I'm not familiar with most of the math around prime numbers but I will read on everything you guys propose and then try to go from there

what I have in mind:

  • the prime numbers shouldn't be too far/close to each other ----> I don't have to increase the size frequently, but the hash table doesn't end up half empty (less collisions, looking for ideal ratio between load factor and hash table size)
  • optimal for a big corpus - I'm not sure how big the prime numbers I have to choose should be, never did this before...
  • I also thought of implementing a function (not a hash function) that'd just double the size of the hash table and then look for the closest prime number ------> but that has a running time of O(n) because a prime is only divisible by itself ____( I have to check whether all the numbers up to the number that's double the size of the current hash table size have the remainder other than zero, then increment the size by one/go to the next odd number and test the whole loop again)________ ------> you can imagine that that would be very slow so the better approach is just to have a fixed set of prime numbers up to a million (just for illustration purposes) or so and then just use these for any size changes

Thanks, any additional questions appreciated

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Fero
  • 567
  • 8
  • 25
  • 1
    Why do you have to find primes *on the fly*? Why not just precompute a table T of primes where T[i+1] is approximately 2 * T[i] and hard-code that in your program? Also, is there some reason you want to do this yourself, as opposed to simply using a python dict? – President James K. Polk Oct 04 '15 at 15:04
  • I'm actually asking for numbers like that, the function was just another option that I explored reason - I'm learning all of this, I want to use my own hash table, explore – Fero Oct 04 '15 at 17:06

2 Answers2

2

Choose high of twin prime numbers, i. e. when p and p - 2 are primes, choose p as double hash capacity, because hash_code % (size - 2) is a good secondary step function for double hashing algorithm, and modulo prime number is somewhat more "robust" than modulo composite number (if size - 2 is composite).

For small sizes (somewhere around 1000 or so) choose all primes, except low ones of twin pairs, because twin pairs are too rare in the beginning of natural numbers scale, for good size predictability.

Add sizes of 5 and 11 (though they are low in twin primes) to better address very small table sizes.

Exclude numbers that are frequently used in multiplication hash functions, in Java it is 31 that is used in String hash function, I don't know about Python.

All the above is carefully coded in this Java runnable, with a lot of pre-generated table sizes (trying to keep 0.005 max difference between neighbouring table sizes):

https://github.com/OpenHFT/Koloboke/blob/0498951705b45be2e1528afd786c03308c36e5dc/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/DHashCapacities.java#L255-L272

P. S. My personal belief is that double hashing is never an optimal open addressing flavor, because of modulo operations which are disproportionately expensive in modern CPUs. Consider using QHash.

leventov
  • 14,760
  • 11
  • 69
  • 98
1

Not sure if I fully understand you question, but here is a possible solution from the java world. I understand why you need prime numbers in general, if you have to write a hash function from scratch, but not sure if you will need to research them if a "good" hash function like this is used.

Hope this helps!

Prashant
  • 1,002
  • 13
  • 29
  • double hashed hash table requires you to use prime number sized hash table, I need a set of numbers to increase the size to incrementally, example - the size is 101, the load factor becomes 2/3 and I wan't to increase the size --> go to the next prime number that is closest to the double of the current size, e.g. 211 thanks though ^^ valuable resource – Fero Oct 03 '15 at 02:41