I have built a distributed caching system and need to map a set of integer keys that range in value from 0 to approximately 8 million that are not uniformly distributed onto a much smaller range of buckets (<100). The distribution of keys is shown below
As of right now I'm using a simple modulo operation to distribute the keys.
def hash_fn(key: int, partitions: int) -> int:
return (key % partitions) + 1
This works relatively well in distributing the keys but there is still a bit of skew. Below is the results from mapping the keys into 40 buckets. The difference between the biggest and the smallest bucket is 430!
Thus far I've tried MD5, sha256, Fibonacci hashing, Knuth's multiplicative method and a few other multiplicative techniques to no avail. None of these methods appear to provide any better distribution than the simple modulo technique.
Do I simply need to live with a slight amount of skew in the distribution or can I do any better?
BUCKET 34: 27963
BUCKET 40: 28062
BUCKET 26: 28086
BUCKET 31: 28095
BUCKET 39: 28096
BUCKET 30: 28100
BUCKET 27: 28101
BUCKET 36: 28123
BUCKET 25: 28128
BUCKET 24: 28131
BUCKET 9: 28133
BUCKET 35: 28150
BUCKET 37: 28151
BUCKET 33: 28156
BUCKET 10: 28157
BUCKET 29: 28159
BUCKET 28: 28169
BUCKET 23: 28174
BUCKET 6: 28175
BUCKET 32: 28180
BUCKET 18: 28186
BUCKET 4: 28191
BUCKET 1: 28194
BUCKET 21: 28195
BUCKET 0: 28210
BUCKET 8: 28221
BUCKET 22: 28230
BUCKET 38: 28233
BUCKET 17: 28236
BUCKET 12: 28240
BUCKET 20: 28246
BUCKET 19: 28251
BUCKET 16: 28251
BUCKET 2: 28260
BUCKET 3: 28261
BUCKET 5: 28272
BUCKET 14: 28289
BUCKET 15: 28290
BUCKET 7: 28312
BUCKET 11: 28368
BUCKET 13: 28393