I'm trying to write a (perfect) hash table for compressing the mapping from unicode codepoint names to their codepoint number (mapping the second column to the first column). As you can see there, the possible inputs are very restricted, in fact there's exactly 38 characters in the alphabet: AB...YZ
, 0...9
, -
and space. Furthermore, there is a lot of (substring) repetition, DIGIT ZERO
, DIGIT ONE
, ..., LATIN CAPITAL LETTER A
, LATIN CAPITAL LETTER B
etc.
The perfect hash table is computed by choosing a seed S
, and then trying to construct a perfect hash table seeding (in some way) the hasher by S
. If a table can't be made, it retries with a new seed. Having a lot of collisions generally requires more retries because it's harder for the algorithm to make everything fit.
The upshot of this is my input domain has low entropy, and the table creation requires a lot of retries with simple hash functions like DJB2; better hashers like FNV works tolerably well, but more complicated and slower functions like SipHash seem to require even fewer retries on average.
Since this is entirely static and precomputed, I'm not too worried about quality for quality's sake (i.e. security and probabilistic distribution for arbitrary input at runtime doesn't matter), but the higher quality functions reduce the precomputation time required for a given level of compression, conversely, allow me to achieve higher compression in some fixed time.
Question: are there efficient published hash functions tuned to input with domain constraints like this? That is, are there hash functions which exploit the extra structure to do fewer operations but still achieve reasonable output?
I have searched for things like 'alphanumeric hash function', but the results are unrelated (mostly just generating an alphanumeric string as the output of a hash function); even some guidance about the correct jargon so that I can search for papers would be helpful.
(This question is motivated by being slightly interesting to solve, not actually necessary.)