0

I'm looking for a hash function f that maps short strings (say no longer than 100 characters) into integer intervals [0, N), where N is typically between 10 and 100, that distributes values as uniformly as possible into the different buckets 0, 1, ..., N-1.

The solution I currently have combines SHA1, CRC32 and a bunch of reversible int -> int transformations as for example outlined in this answer, followed by a final modulo operation. It works rather well, but I feel that there is still room for improvement, since the buckets I get are not always as evenly sized as I would have hoped.

Last but not least, let me briefly outline my use case: I use this hash function to split a labeled data set into train/validation & test sets for supervised machine learning, based on the string identifiers of the individual rows. So, given a hash into [0, 10), I then define my training data to have hashes {0, 1, 2, 3, 4, 5}, my validation data to have hashes {6, 7}, and finally my test data to have hashes {8, 9}. I could of course also just use a random split, but the method with the hashes seems very appealing to me, because it's stable, flexible and transparent.

To sum things up: What kind of hash function would you suggest with the aforementioned properties, for the use case I've just described?

pjs
  • 18,696
  • 4
  • 27
  • 56
Matthias Langer
  • 994
  • 8
  • 22
  • If these short strings are uniformly distributed across their domain then nothing more than converting them to integers and taking the result mod N is needed. Most likely they are not. In that case computing the SHA1 hash and then taking that mod N is more than adequate and almost certainly overkill. I don't see any use for CRC-32. Many hash applications need protection from hash flooding attacks, though this doesn't seem to be one of those. If so, consider siphash. In fact, I'd use siphash for everything. If you don't need hash flooding protection then just used a fixed key – President James K. Polk Sep 15 '21 at 15:50
  • The strings I'm hashing are not uniformly distributed indeed. My first instinct was also to go for SHA1 followed by a simple string hash and taking the mod, but this can lead to fairly uneven distributions, at least with small datasets, as demonstrated in this notebook: https://colab.research.google.com/drive/1XMRRwHwbbH2J5XZ56HHnX3_LjCAlS5Vk?authuser=1#scrollTo=uBw6Eo2-Yndb&line=1&uniqifier=1 – Matthias Langer Sep 16 '21 at 08:03

0 Answers0