6

I'm trying to write a (perfect) hash table for compressing the mapping from unicode codepoint names to their codepoint number (mapping the second column to the first column). As you can see there, the possible inputs are very restricted, in fact there's exactly 38 characters in the alphabet: AB...YZ, 0...9, - and space. Furthermore, there is a lot of (substring) repetition, DIGIT ZERO, DIGIT ONE, ..., LATIN CAPITAL LETTER A, LATIN CAPITAL LETTER B etc.

The perfect hash table is computed by choosing a seed S, and then trying to construct a perfect hash table seeding (in some way) the hasher by S. If a table can't be made, it retries with a new seed. Having a lot of collisions generally requires more retries because it's harder for the algorithm to make everything fit.

The upshot of this is my input domain has low entropy, and the table creation requires a lot of retries with simple hash functions like DJB2; better hashers like FNV works tolerably well, but more complicated and slower functions like SipHash seem to require even fewer retries on average.

Since this is entirely static and precomputed, I'm not too worried about quality for quality's sake (i.e. security and probabilistic distribution for arbitrary input at runtime doesn't matter), but the higher quality functions reduce the precomputation time required for a given level of compression, conversely, allow me to achieve higher compression in some fixed time.

Question: are there efficient published hash functions tuned to input with domain constraints like this? That is, are there hash functions which exploit the extra structure to do fewer operations but still achieve reasonable output?

I have searched for things like 'alphanumeric hash function', but the results are unrelated (mostly just generating an alphanumeric string as the output of a hash function); even some guidance about the correct jargon so that I can search for papers would be helpful.

(This question is motivated by being slightly interesting to solve, not actually necessary.)

huon
  • 94,605
  • 21
  • 231
  • 225
  • You want a perfect hash for 27268 items? Seems hard to me. Why not just use a *standard* hashfunction and handle the collisions? (and maybe use a low fill factor) – wildplasser Oct 04 '14 at 11:32
  • @wildplasser it works fine, it can just takes a little while to generate. E.g. [this array](https://github.com/huonw/unicode_names/blob/1f331f78201b914604346e1d6fc3e9b3b2eda772/src/generated_phf.rs#L771) is the hashtable itself: use the hash of the input string as an index into that table (and then verify it's correct). The point of this question is exploiting the structure of the input to be faster, by doing as little work as possible. Also, this is for compression, so a low-load factor is not good. – huon Oct 04 '14 at 12:04
  • @wildplasser Lastly, note that I am currently using a standard hash function (I actually mention three in the question). – huon Oct 04 '14 at 12:05
  • The lower cardinality of your input alphabet (~5.5 bits) does not really matter, as long as there is enough avalanche effect in the hashfunction. The (invariant) two leading zeros should not confuse the "standard" hashfunctions. I just checked, and my own shift&xor hashfunction works just as well on the 38 char alphabet as on the full ~7 bit ascii set. Question: are you optimising for space, or for speed(an overflow-chain needs only one pointer+one bit per entry)? – wildplasser Oct 04 '14 at 12:18
  • @wildplasser Optimising for space, but that is done by optimising for speed. The bottleneck here is generating the table, which consists of randomly choosing a seed for the hash and trying to construct a hash-table which works. For high compression factors, it may require trying many seeds because they often don't work, and this can take a long time. If a hash function can be more efficient by exploiting the structure of the input, I will be able to compress the table more, in a shorter time. (The question is mostly out of interest, since I can always just throw CPU time at it.) – huon Oct 04 '14 at 12:27
  • @wildplasser That said, runtime efficiency is important too, and collisions are expensive (checking if an entry is correct requires some rather fancy operations) and complicate the table non-trivially: it's much easier to just have a perfect hash. – huon Oct 04 '14 at 12:29
  • Generating the table is at compile time, or at program initialisation time? BTW: What do you mean by "compression factor" ? BTW: collisions are *not* expensive. Normally, the strings will differ in the first character. I there is a common prefix and you have the space, you could store the full (32bit) hash *in the entry*. (it will probably be relatively easy to compose a collision free 32bit hash for N=16K kentries) – wildplasser Oct 04 '14 at 12:30
  • @wildplasser It is done at compile time (hence the question being out of interest). The PHF stores [`O(number of entries)` bytes of metadata](https://github.com/huonw/unicode_names/blob/1f331f782/src/generated_phf.rs#L3-L770); the constant factor can be reduced (i.e. achieve a higher compression) at the cost of making generation harder. The algorithm is described in ["Hash, Displace, Compress" Belazzougui et al. 2009](http://cmph.sourceforge.net/papers/esa09.pdf); you can also look at [my implementation](https://github.com/huonw/unicode_names/blob/1f331f782/generator/phf.rs), FWIW. – huon Oct 04 '14 at 12:36
  • @wildplasser, collisions *are* expensive. The process for finding even the first byte given a hash is somewhat expensive. There's essentially no extra metadata I can store, even a single extra byte per entry increases the storage required by 15-20%. Again, note that I can **easily** create collision free hash tables; it is just pushing the limits of the compression that is "problematic". (e.g. the one currently in the source linked above is collision free.) – huon Oct 04 '14 at 12:52
  • @wildplasser also, this data is somewhat surprising: 17% of the entries begin with `C` and 50% of the entries begin with one of `C`, `M`, `L`, `B` and `E`; and the 4 letter prefix of 20% of words fall into a set of 5 possibilities `ARAB`, `YI S`, `CJK `, `LATI`, `EGYP` (there are 1083 unique 4 letter prefixes in the data set). *and*, there are a lot of that start with long words like `EGYPTIAN`, `CUNEIFORM` and `MATHEMATICAL` (about 1000 for each of those). – huon Oct 04 '14 at 12:57
  • Which probably means that in the "final stringcompare" only in about ten percent of the cases more than 4 characters will be needed to detect a difference. BTW: the average stringlength is 25. – wildplasser Oct 04 '14 at 13:22

1 Answers1

0

I'm trying to write a (perfect) hash table ...

If you want a perfect hash function I would generate it with something like CMPH. This may end up being some static lookup table behind the scenes.

Alternatively you could use a non-hash based approach for instance with a DAWG or some Trie-like structure (and some Aho-Corasick on top?).

A DAWG would give a fairly compact storage and fast searches for strings to numbers. My hunch is that it would likely beat a hash table for your problem.

See http://www.wutka.com/dawg.html for some intro. There are implementations in several languages.

Philippe Ombredanne
  • 2,017
  • 21
  • 36