0

If I have some data I hash with SHA256 like this :- hash=SHA256(data)

And then copy only the first 8 bytes of the hash instead of the whole 32 bytes, how easy is it to find a hash collision with different data? Is it 2^64 or 2^32 ?

If I need to reduce a hash of some data to a smaller size (n bits) is there any way to ensure the search space 2^n ?

Harry
  • 221
  • 3
  • 8
  • [Collision Resistance](http://en.wikipedia.org/wiki/Collision_resistance) – President James K. Polk Feb 23 '15 at 11:46
  • 1
    [cross-posted on crypto](http://crypto.stackexchange.com/questions/24072/reducing-size-of-hash-function) – CodesInChaos Feb 23 '15 at 19:26
  • 3
    I'm voting to close this question as off-topic because it seems to be about cryptanalysis and doesn't include a programming question. I also note you've cross-posted it to Crypto, which is perhaps a more sensible place for it to live. – Duncan Jones Feb 24 '15 at 15:12

1 Answers1

5

I think you're actually interested in three things.

The first you need to understand is the entropy distribution of the hash. If the output of a hash function is n-bits long, then the maximum entropy is n bits. Note that I say maximum; you are never guaranteed to have n bits of entropy. Similarly, if you truncate the hash output to n/4 bits, you are not guaranteed to have a 2n/4 bits of entropy in the result. SHA-256 is fairly uniformly distributed, which means in part that you are unlikely to have more entropy in the high bits than the low bits (or vice versa).

However, information on this is sparse because the hash function is intended to be used with its whole hash output. If you only need an 8-byte hash output, then you might not even need a cryptographic hash function and could consider other algorithms. (The point is, if you need a cryptographic hash function, then you need as many bits as it can give you, as shortening the output weakens the security of the function.)

The second is the search space: it is not dependent on the hash function at all. Searching for an input that creates a given output on a hash function is more commonly known as a Brute-Force attack. The number of inputs that will have to be searched does not depend on the hash function itself; how could it? Every hash function output is the same: every SHA-256 output is 256 bits. If you just need a collision, you could find one specific input that generated each possible output of 256 bits. Unfortunately, this would take up a minimum storage space of 256 * 2256 ≈ 3 * 1079 for just the hash values themselves (i.e. not counting the inputs needed to generate them), which vastly eclipses the entire hard drive capacity of the entire world.

Therefore, the search space depends on the complexity and length of the input to the hash function. If your data is 8-character long ASCII strings, then you're pretty well guaranteed to never have a collision, BUT the search space for those hash values is only 27*8 ≈ 7.2 * 1016, which could be searched by your computer in a few minutes, probably. After all, you don't need to find a collision if you can find the original input itself. This is why salts are important in cryptography.

Third, you're interested in knowing the collision resistance. As GregS' linked article points out, the collision resistance of a space is much more limited than the input search space due to the pigeonhole principle.

Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, the pigeonhole principle guarantees that some inputs will hash to the same output. Collision resistance doesn't mean that no collisions exist; simply that they are hard to find.

The "birthday paradox" places an upper bound on collision resistance: if a hash function produces N bits of output, an attacker who computes "only" 2N/2 (or sqrt(2N)) hash operations on random input is likely to find two matching outputs. If there is an easier method than this brute force attack, it is typically considered a flaw in the hash function.

So consider what happens when you examine and store only the first 8 bytes (one fourth) of your output. Your collision resistance has dropped from 2256/2 = 2128 to 264/2 = 232. How much smaller is 232 than 2128? It's a whole lot smaller, as it turns out, approximately 0.0000000000000000000000000001% of the size at best.

Community
  • 1
  • 1
Patrick M
  • 10,547
  • 9
  • 68
  • 101