20

MD5 and SHA-1 hashes have weaknesses against collision attacks. SHA256 does not but it outputs 256 bits. Can I safely take the first or last 128 bits and use that as the hash? I know it will be weaker (because it has less bits) but otherwise will it work?

Basically I want to use this to uniquely identify files in a file system that might one day contain a trillion files. I'm aware of the birthday problem and a 128 bit hash should yield about a 1 in a trillion chance on a trillion files that there would be two different files with the same hash. I can live with those odds.

What I can't live with is if somebody could easily, deliberately, insert a new file with the same hash and the same beginning characters of the file. I believe in MD5 and SHA1 this is possible.

Sunny Hirai
  • 201
  • 1
  • 2
  • 3
  • I had thought the birthday paradox would give lower odds than that, but Wikipedia agrees with you: http://en.wikipedia.org/wiki/Birthday_paradox#Probability_table – Mark Ransom Jun 11 '10 at 23:03
  • Related question: http://stackoverflow.com/questions/2256423/truncating-an-md5-hash-how-do-i-calculate-the-odds-of-a-collision-occuring – Shadok Feb 01 '12 at 14:53
  • 2
    See also: http://security.stackexchange.com/questions/18385/does-truncating-the-cryptographic-hash-make-it-impossible-to-crack – Luc Apr 10 '13 at 14:37
  • So... without making another vague reference to the birthday paradox Wikipedia article, can somebody sum up in brief, relatively non-technical language why it's okay to truncate the output from the hash algorithm? If that's such a good idea, why doesn't the hash algorithm just save you the trouble and truncate itself? Put another way, the hash algorithm produces an output that is guaranteed, en total, within the parameters of the algorithm to be unique for every input. Does the actual algorithm _itself_ guarantee that the first 128 characters will be unique? – Craig Tullis Apr 15 '13 at 07:01
  • 1
    Can you really infer that it's valid to truncate the output from SHA-256 from an article about the birthday paradox which discusses hashing in general, but nowhere mentions the effects of truncating the outputs of hashing algorithms, let alone the effects of truncating the outputs of any _specific_ hashing algorithms? SHA-256 produces a 256-bit result, yeah? It does _not_ output a 128-bit result. Where do the authors of the algorithm state that if you arbitrarily discard 128 bits of the result, that you're safe? How is truncated SHA-256 safer than full 160-bit SHA-1, for that matter? – Craig Tullis Apr 15 '13 at 07:12

4 Answers4

8

Yeah that will work. Theoretically it's better to XOR the two halves together but even truncated SHA256 is stronger than MD5. You should still consider the result a 128 bit hash rather than a 256 bit hash though.

My particular recommendation in this particular case is to store and reference using HASH + uniquifier where uniquifier is the count of how many distinct files you've seen with this hash before. This way you don't absolutely fall down flat if somebody tries to store future discovered collision vectors for SHA256.

Joshua
  • 40,822
  • 8
  • 72
  • 132
  • 14
    I can find no reference that says it is theoretically better to XOR the halves together, and I'm skeptical that it is. Interesting idea with the uniquifier. – President James K. Polk Jun 12 '10 at 13:15
  • GregS: some of the early attacks on MD5 resulted in collisions on most of the hash with one or two cells different. – Joshua Jun 12 '10 at 13:46
  • 1
    @Joshua That sounds like its empirically (not theoretically) better then. I'm also interested in a reference as to why XOR would be better. – Drux Feb 03 '15 at 19:11
  • You don't need to XOR the two halves, the official standard says you can just take the leftmost 128 bits (see [this answer](https://security.stackexchange.com/a/72685/255807)). – Eric Mutta Apr 18 '21 at 22:05
3

But is it worth it? If you have a hash for each file, then you essentially have an overhead for each file. Let's say that each file must take up at least 512 bytes (a typical disk sector) and that you're storing these hashes compactly enough so as to not have each hash take up much more than the hash size.

So, even if all your files are 512 bytes, the smallest, you're talking either 16 / 512 = 3.1% or 32 / 512 = 6.3%. In reality, I'd bet your average file size is higher (unless all your files are 1 sector...), so that overhead would be less.

Now, the amount of space you need for hashes scales linearly with the number of files you have. Is that extra space worth that much? Even if you had your mentioned trillion files - that's 1 000 000 000 000 * 16 = ~29 TiB, which is a lot of space, but keep in mind: your data would be 1 000 000 000 000 * 512 = 465 TiB. The numbers are worthless, really, since it's still 3% or 6% overhead. But at this level, where you have a half petabyte of storage, does 15 terabytes matter? At any level, does a 3% savings mean anything? And remember, if they're larger, you save less. (Which, they probably are: good luck getting a 512 byte sector size at that hard disk size.)

So, is this 3% or less disk savings worth the potential risk in security. (Which I'll leave unanswered, as it's waaay not my cup of tea.)

Alternatively, could you, say, group files together in some logical fashion, so that you have less files? (I mean, if you have trillions of 512 byte files, do you really want to hash every byte on disk?)

Thanatos
  • 42,585
  • 14
  • 91
  • 146
  • 3
    Doesn't really answer the question. Does it? – ALOToverflow Apr 18 '13 at 14:24
  • 5
    @ALOToverflow: No, it doesn't. But that doesn't mean it isn't relevant: sometimes questioning the premise of the question may lead to a better solution for either the poster, the general audience reading the question later via Google, or both: SO is here to be helpful, so I consider such posts worthwhile. I perhaps should have stressed the security aspect harder: in my experience, in most things dealing with cryptography, if you deviate from the beaten path, weird (and usually bad) things tend to happen. Is that worth a slight savings in disk? (It might be, but it depends on use-case.) – Thanatos Apr 18 '13 at 22:17
0

Yes, that will work.

For the record, there are known in-use collision attacks against MD5, but the SHA-1 attacks are at this point completely theoretical (no SHA-1 collision has ever been found... yet).

BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
  • 2
    SHA-256 (the hash which the OP is talking about) is SHA-2 though, not SHA-1 - I think? And so far no collisions have been found for SHA-2.. not even theoretically. – user353297 Jun 11 '10 at 23:04
  • @blueraja- not totally true. check out: http://people.csail.mit.edu/yiqun/SHA1AttackProceedingVersion.pdf – Yuval Adam Jun 11 '10 at 23:06
  • 1
    @mrl33t: No; SHA-1 has theoretical vulnerabilities, but SHA-256 (which is part of the SHA-2 suite) does not even have those. Considering the size of SHA-256 hashes are 2^128 times LARGER than SHA-1, and SHA-2 is thought to be more theoretically secure, it's not likely there'll be any SHA-256 collisions any time soon. – BlueRaja - Danny Pflughoeft Jun 11 '10 at 23:09
  • @Yuval: Yes, that is the theoretical vulnerability I mentioned (actually, there is a more recent paper that reduces the search space even more). Even so, what I said was completely true: there are still no known collisions for SHA-1. – BlueRaja - Danny Pflughoeft Jun 11 '10 at 23:10
  • 2
    2^128 times larger? WOW! ;) I think you might want to check your math, or your wording... – Dan McGrath Jun 11 '10 at 23:10
  • 1
    @Dan: whoops, I meant the **search space** is **2^96** times larger, sorry (2^96 * 2^160 = 2^256) – BlueRaja - Danny Pflughoeft Jun 11 '10 at 23:13
  • 2
    SHA-1 collision has been found earlier this year: https://shattered.io/ – Palec Aug 26 '17 at 22:35
0

Crypto does something similar, for example Ethereum addresses are the low-order 160 bits of the Keccak (precursor to SHA-3) hash.

enter image description here

zencraft
  • 101
  • 2