I have a million randomly generated unique IDs.
If I do:
result = int(hash(id + 'some_salt')) % 1000
Then this seems to result in an even distribution of IDs to some integer between 0 and 999, with each integer having approximately 1000 IDs mapped to it.
If I now append some salt to this and take the hash again:
x = int(hash(id)) % 1000
result = int(hash(str(x) + 'some_salt') % 1000)
Then the resulting distribution is completely non-uniform. For each ID, the result is of course in the range of [0,999] but some integers in this range have zero IDs mapped to them, while others have several thousand.
Why does this result in a very non-uniform distribution of values?
How can I adjust this to result in a uniform distribution of integers in the range [0,999] for my million IDs, and any given salt? I want to keep the intermediate step of reducing the potentially very large input space to some much smaller space (e.g. of size 1000).
I'm using SHA-256 hashing.
Here is some Python code which demonstrates the very non-uniform results:
import numpy as np
import hashlib
OUTPUT_RANGE_SIZE = 1000
unique_ids = xrange(1000000) # sequential here, but could be any kind of unique ids
frequencies = np.zeros(OUTPUT_RANGE_SIZE, dtype='int')
for idx in xrange(len(unique_ids)):
id = unique_ids[idx]
hash_mod = int(hashlib.sha256(str(id)).hexdigest(), 16) % 1000
result = int(hashlib.sha256(str(hash_mod) + 'some_salt').hexdigest(), 16) % OUTPUT_RANGE_SIZE
frequencies[result] = frequencies[result] + 1
print frequencies