Design for max hash size given N-digit numerical input and collision related target

Question

Assume a hacker obtains a data set of stored hashes, salts, pepper, and algorithm and has access to unlimited computing resources. I wish to determine a max hash size so that the certainty of determining the original input string is nominally equal to some target certainty percentage.

Constraints:

The input string is limited to exactly 8 numeric characters uniformly distributed. There is no inter-digit relation such as a checksum digit.

The target nominal certainty percentage is 1%.

Assume the hashing function is uniform.

What is the maximum hash size in bytes so there are nominally 100 (i.e. 1% certainty) 8-digit values that will compute to the same hash? It should be possible to generalize to N numerical digits and X% from the accepted answer.

Please include whether there are any issues with using the first N bytes of the standard 20 byte SHA1 as an acceptable implementation.

It is recognized that this approach will greatly increase susceptibility to a brute force attack by increasing the possible "correct" answers so there is a design trade off and some additional measures may be required (time delays, multiple validation stages, etc).

I'm not sure I understand what you want. Are assuming that the hashed values can be cracked, and want to diminish their value by ensuring collisions, so that at best they will only know that 1 of 100 possible inputs hashed to a particular value? — hatchet - done with SOverflow, Aug 07 '13 at 20:32
Nearly perfect summary!! Except "at best" -> "nominally" because I was thinking "at best" was impossible (prove me wrong is okay). — crokusek, Aug 07 '13 at 22:56
I have done something like this, and can post an answer, but if you go with this idea, understand that you will very likely get collisions within the data you handle, whereas with a normal hash that would be rare. — hatchet - done with SOverflow, Aug 07 '13 at 23:07

hatchet - done with SOverflow · Accepted Answer · 2013-08-08T19:14:56.497

It appears you want to ensure collisions, with the idea that if a hacker obtained everything, such that it's assumed they can brute force all the hashed values, then they will not end up with the original values, but only a set of possible original values for each hashed value.

You could achieve this by executing a precursor step before your normal cryptographic hashing. This precursor step simply folds your set of possible values to a smaller set of possible values. This can be accomplished by a variety of means. Basically, you are applying an initial hash function over your input values. Using modulo arithmetic as described below is a simple variety of hash function. But other types of hash functions could be used.

If you have 8 digit original strings, there are 100,000,000 possible values: 00000000 - 99999999. To ensure that 100 original values hash to the same thing, you just need to map them to a space of 1,000,000 values. The simplest way to do that would be convert your strings to integers, perform a modulo 1,000,000 operation and convert back to a string. Having done that the following values would hash to the same bucket: 00000000, 01000000, 02000000, ....

The problem with that is that the hacker would not only know what 100 values a hashed value could be, but they would know with surety what 6 of the 8 digits are. If the real life variability of digits in the actual values being hashed is not uniform over all positions, then the hacker could use that to get around what you're trying to do.

Because of that, it would be better to choose your modulo value such that the full range of digits are represented fairly evenly for every character position within the set of values that map to the same hashed value.

If different regions of the original string have more variability than other regions, then you would want to adjust for that, since the static regions are easier to just guess anyway. The part the hacker would want is the highly variable part they can't guess. By breaking the 8 digits into regions, you can perform this pre-hash separately on each region, with your modulo values chosen to vary the degree of collisions per region.

As an example you could break the 8 digits thus 000-000-00. The prehash would convert each region into a separate value, perform a modulo, on each, concatenate them back into an 8 digit string, and then do the normal hashing on that. In this example, given the input of "12345678", you would do 123 % 139, 456 % 149, and 78 % 47 which produces 123 009 31. There are 139*149*47 = 973,417 possible results from this pre-hash. So, there will be roughly 103 original values that will map to each output value. To give an idea of how this ends up working, the following 3 digit original values in the first region would map to the same value of 000: 000, 139, 278, 417, 556, 695, 834, 973. I made this up on the fly as an example, so I'm not specifically recommending these choices of regions and modulo values.

If the hacker got everything, including source code, and brute forced all, he would end up with the values produced by the pre-hash. So for any particular hashed value, he would know that that it is one of around 100 possible values. He would know all those possible values, but he wouldn't know which of those was THE original value that produced the hashed value.

You should think hard before going this route. I'm wary of anything that departs from standard, accepted cryptographic recommendations.

Great answer, the extra space transformation makes sense and its domain size is the key I was missing. I believe the pre-hash stage can actually be skipped and that the size of the final hash should be determined the way you say for the pre-hash except we're mapping to bits instead of base 10. So 100,000,000 values need to map (scrunch) into 1,000,000 final values. The bits required are log(1,000,000) / log(2) = 19.9315 and rounding down (more scrunching) to 19 bits. So would apply a mask of 0x7ffff to the 20 byte SHA1 result (or just modulus 1,000,000 it). Does this jump look valid? — crokusek, Aug 08 '13 at 05:07
Probably better to do a double hash as you said, but why not just do the prehash as a masked (or modulus) SHA1 (per previous comment)? Then convert to hex string and apply salt and pepper and hash again for the stored value. This keeps the final stored value using the full SHA1 20 byte space. — crokusek, Aug 08 '13 at 06:10
When I did something similar to this, I needed to have complete control over how the original values were mapped to the smaller space, for reasons having to with the nature of the original data values I was working with. Before going your route, since you're only dealing with millions of possible values, I would run the algorithm on the whole set of values, and compute some metrics to ensure the original values actually distribute well over the reduced space. If only 5 values map to one final hash, and 200 map to another, that's not ideal. With the pre-hash you have total control over that. — hatchet - done with SOverflow, Aug 08 '13 at 07:20
Yes, when double hashing, for the prehash I'm suggesting to use a modulus or mask of a standard hash such as SHA1 because each bit should offer great uniformity--a property of good hashes even for non-uniformly distributed input data. In this manner no custom prehash is needed. But verifying this is great advice nonetheless. — crokusek, Aug 08 '13 at 18:59

Design for max hash size given N-digit numerical input and collision related target

1 Answers1