0

The problem

We have a set of symbol sequences, which should be mapped to a pre-defined number of bucket-indexes.

Prerequisites

The symbol sequences are restricted in length (64 characters/bytes), and the hash algorithm used is the Delphi implementation of the Bob Jenkins hash for a 32bit hashvalue.

To further distribute the these hashvalues over a certain number of buckets we use the formula:

  • bucket_number := (hashvalue mod (num_buckets - 2)) + 2);
    (We don't want {0,1} to be in the result set)

The question

A colleague had some doubts, that we need to choose a prime number for num_buckets to achieve an optimal1 distribution in mapping the symbol sequences to the bucket_numbers.

The majority of the team believe that's more an unproven assumption, though our team mate just claimed that's mathematically intrinsic (without more in depth explanation).

I can imagine, that certain symbol sequence patterns we use (that's just a very limited subset of what's actually allowed) may prefer certain hashvalues, but generally I don't believe that's really significant for a large number of symbol sequences.
The hash algo should already distribute the hashvalues optimally, and I doubt that a prime number mod divisor would really make a significant difference (couldn't measure that empirically either), especially since Bob Jenkins hash calculus doesn't involve any prime numbers as well, as far I can see.

[TL;DR]
Does a prime number mod divisor matter for this case, or not?


1) optimal simply means a stable average value of number-of-sequences per bucket, which doesn't change (much) with the total number of sequences

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190
  • 1
    If it mattered, you'd want num_buckets - 2 to be prime. But based on your description I don't think it matters. – President James K. Polk May 17 '22 at 20:06
  • The fact that your colleague can't justify the claim says a lot – David Heffernan May 17 '22 at 21:45
  • Note that if the symbol sequences can be end user supplied then you probably need to use a hash function that is secure against attack. Usually that means being secure against an attacker picking values that cause worst case running time and/or memory usage. – Brian May 17 '22 at 21:46
  • *optimal distribution* What is this ?? I remember my first course related to optimization (ok, about 45 years ago): "no criteria, no optimization". The first step is to clearly define the criteria. During my carrier, I have often heard *this is optimal' or even better *every one knows it is optimal*. And when I asked about the criteria, I generally got *but, this is evident !*. In your case, I tried to find a criteria for which a prime would be optimal... I could not find one. Can your colleague detail such a criteria? – Damien May 18 '22 at 08:22
  • @Damien Well, _optimal_ simply means a stable average value of number-of-sequences per bucket, wich doesn't change (much) with the total number of sequences (that's my definition of the criteria). – πάντα ῥεῖ May 18 '22 at 10:20
  • @πάνταῥεῖ In this case, I cannot really think how primality could help in any manner. Maximizing the number of buckets helps a little bit, as you mentioned. The two processes, hashing and reduction, must be uncorrelated, that is enough. – Damien May 18 '22 at 10:31
  • 1
    @Damien _"In this case, I cannot really think how primality could help in any manner."_ Me neither :-) – πάντα ῥεῖ May 18 '22 at 10:36
  • Reading about symbol references that are likely to be strings (given that you wrote "64 characters/bytes") makes me wonder if a hashtable involving a hash function (and a rather slow one I have to add) is the best way and if it would not be better by simply using the first character as index into a 2dim array using linear search on the 2nd dim which only contains all symbol references starting with the same letter. – Stefan Glienke May 18 '22 at 12:46
  • @StefanGlienke no, that won't really help. It's not about efficient memory representation, or fast searching in this specific case. Just stable mapping of symbol-sequences to bucket-numbers. – πάντα ῥεῖ May 18 '22 at 15:00
  • I can guarantee you that the "use first char as bucket index and then linear search the entries in that bucket" blows a hash table approach using bobjenkins out of the water. Given the number of entries in each bucket is not too high that is, of course. You wrote "symbol references" so I have something like keywords in a parser/lexer in mind - could be something different though where the number of elements is very large - at some point linear search falls behind. As for the "stable bucket number mapping" I have the slight feeling of an XY problem - maybe elaborate on that reason. – Stefan Glienke May 18 '22 at 15:28
  • @StefanGlienke _"You wrote "symbol references" so I have something like keywords in a parser/lexer in mind"_ no, it's a way simpler problem, and no searching is applied as I mentioned. We just need to guarantee, that the same symbol-sequences (i.e. strings) produce the same bucket number, and no bucket numbers are preferred by the algo. – πάντα ῥεῖ May 18 '22 at 16:55
  • @StefanGlienke nice to meet you here BTW, IIRC I am using some of your public stuff. You're a celebrity :-) – πάντα ῥεῖ May 18 '22 at 18:28

1 Answers1

3

Your colleague is simply wrong.

If a hash works well, all hash values should be equally likely, with a relationship that is not obvious from the input data.

When you take the hash mod some value, you are then mapping equally likely hash inputs to a reduced number of output buckets. The result is now not evenly distributed to the extent that outputs can be produced by different numbers of inputs. As long as the number of buckets is small relative to the range of hash values, this discrepancy is small. It is on the order of # of buckets / # of hash values. Since the number of buckets is typically under 10^6 and the number of hash values is more than 10^19, this is very small indeed. But if the number of buckets divides the range of hash values, there is no discrepancy.

Primality doesn't enter into it except from the point that you get the best distribution when the number of buckets divides the range of the hash function. Since the range of the hash function is usually a power of 2, a prime number of buckets is unlikely to do anything for you.

btilly
  • 43,296
  • 3
  • 59
  • 88