2

I have a million randomly generated unique IDs.

If I do:

result = int(hash(id + 'some_salt')) % 1000

Then this seems to result in an even distribution of IDs to some integer between 0 and 999, with each integer having approximately 1000 IDs mapped to it.

If I now append some salt to this and take the hash again:

x = int(hash(id)) % 1000
result = int(hash(str(x) + 'some_salt') % 1000)

Then the resulting distribution is completely non-uniform. For each ID, the result is of course in the range of [0,999] but some integers in this range have zero IDs mapped to them, while others have several thousand.

Why does this result in a very non-uniform distribution of values?

How can I adjust this to result in a uniform distribution of integers in the range [0,999] for my million IDs, and any given salt? I want to keep the intermediate step of reducing the potentially very large input space to some much smaller space (e.g. of size 1000).

I'm using SHA-256 hashing.

Here is some Python code which demonstrates the very non-uniform results:

import numpy as np
import hashlib

OUTPUT_RANGE_SIZE = 1000

unique_ids = xrange(1000000) # sequential here, but could be any kind of unique ids
frequencies = np.zeros(OUTPUT_RANGE_SIZE, dtype='int')

for idx in xrange(len(unique_ids)):
    id = unique_ids[idx]
    hash_mod = int(hashlib.sha256(str(id)).hexdigest(), 16) % 1000
    result = int(hashlib.sha256(str(hash_mod) + 'some_salt').hexdigest(), 16) % OUTPUT_RANGE_SIZE
    frequencies[result] = frequencies[result] + 1

print frequencies
Josh
  • 697
  • 6
  • 21
  • Are you _sure_ you're using SHA-256? 'hash' in many languages (such as Python) is an internal function that's not secure, and in fact not even stable across instances, and would likely behave as you describe. SHA-256 creates a byte array or encoded string, and I can't think of any language that would allow you to coerce it to an int like you have above. – Nick Johnson Mar 30 '15 at 10:30
  • Well, I am using the hashlib library for Python and converting the hex representation to an int... for example like this: int(hashlib.sha256(id + 'some_string').hexdigest(), 16) % 1000 ... If you want to see my code for this, I have pasted it here: http://pastebin.com/sMP4G2vQ - uncommenting the print line will show the very non-uniform results – Josh Mar 30 '15 at 10:41
  • You should edit your question to use that actual code - pastebins tend not to stick around. In any case, I ran your code with both randomly selected and sequential IDs, and in either case the results are well distributed, as one would expect. – Nick Johnson Mar 30 '15 at 10:51
  • Sorry! I pasted you the wrong code. I have just updated the question with the code I meant to paste... – Josh Mar 30 '15 at 11:03

1 Answers1

3

By applying the modulo operator on your first hash operation, you've ensured that there are only 1000 unique outputs from that stage, regardless of how many unique numbers you had as inputs. When you hash it and modulo it again, by chance some of those hashes will map to the same buckets; as a result the number of values in the bucket will be roughly 1000 times the number of values that hashed to that bucket ID. You can see this by dividing your values in the frequencies array by 1000:

[1, 0, 2, 1, 0, 0, 0, ...]

If you remove the modulo operator from the first step, your output values in the second step will be evenly distributed as expected.

Obligatory postscript: Don't invent your own cryptosystems. If this is security critical, learn about the best practices and implement them.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • Thanks, that makes sense. Do you think there's any way to reduce the inputs to 1000 possibilities and still achieve evenly distributed results for my million IDs and any given salt? This isn't actually related to security - I just want a random but deterministic (using a salt) way of mapping the IDs to an int in [0,999], with an intermediate step that reduces the range of possible inputs (e.g. to 1000 possibilities) – Josh Mar 30 '15 at 11:46
  • (The intermediate step should take place before using the salt) – Josh Mar 30 '15 at 12:00
  • @Josh Why does the intermediate step need to reduce the number of possibilities? If your intermediate step doesn't do the modulo, you get good quality results. – Nick Johnson Mar 30 '15 at 12:33
  • 1
    To answer your question under those constraints, though: What you need is an intermediate mixing step that is 1:1. For an example, see this blog post: http://blog.notdot.net/2007/9/Damn-Cool-Algorithms-Part-2-Secure-permutations-with-block-ciphers . Alternately, make your intermediate range much larger than your output range, so there's less unevenness in the mapping. – Nick Johnson Mar 30 '15 at 12:36
  • well the reason I want the intermediate step is for caching purposes. I am building a client-server architecture and I want to reduce the number of inputs on the client, to facilitate caching. The rest of the computation (including adding the salt) will take place separately on the server. – Josh Mar 30 '15 at 12:38
  • Thanks for the link. Perhaps I can achieve what I want like this: (1) do hash(id) mod 1000 on the client to get an index. (2) create a list containing every int in [0,999]. (3) shuffle the list using the salt as a seed. (4) access the shuffled list using the index from (1) to get the result. Then only step (1) needs to be on the client, and it should result in the uniform distribution I want? – Josh Mar 30 '15 at 12:47
  • @Josh Yes, that will work - I'm so used to thinking of very large ranges that the obvious solution didn't occur to me for generating a permutation: Just generate one. – Nick Johnson Mar 30 '15 at 13:32