Collision Attacks, Message Digests and a Possible solution

Question

I've been doing some preliminary research in the area of message digests. Specifically collision attacks of cryptographic hash functions such as MD5 and SHA-1, such as the Postscript example and X.509 certificate duplicate.

From what I can tell in the case of the postscript attack, specific data was generated and embedded within the header of the postscript file (which is ignored during rendering) which brought about the internal state of the md5 to a state such that the modified wording of the document would lead to a final MD value equivalent to the original postscript file. The X.509 took a similar approach where by data was injected within the comment/whitespace sections of the certificate.

Ok so here is my question, and I can't seem to find anyone asking this question:

Why isn't the length of ONLY the data being consumed added as a final block to the MD calculation?
In the case of X.509 - Why is the whitespace and comments being taken into account as part of the MD?

Wouldn't a simple processes such as one of the following be enough to resolve the proposed collision attacks:

MD(M + |M|) = xyz
MD(M + |M| + |M| * magicseed_0 +...+ |M| * magicseed_n) = xyz

where :

M : is the message
|M| : size of the message
MD : is the message digest function (eg: md5, sha, whirlpool etc)
xyz : is the pairing of the acutal message digest value for the message M and |M|. <M,|M|>
magicseed_{i}: Is a set of random values generated with seed based on the internal-state prior to the size being added.

This technqiue should work, as to date all such collision attacks rely on adding more data to the original message.

In short, the level of difficulty involved in generating a collision message such that:

It not only generates the same MD
But is also comprehensible/parsible/compliant
and is also the same size as the original message,

is immensely difficult if not near impossible. Has this approach ever been discussed? Any links to papers etc would be nice.

Further Question: What is the lower bound for collisions of messages of common length for a hash function H chosen randomly from U, where U is the set of universal hash functions ?

Is it 1/N (where N is 2^(|M|)) or is it greater? If it is greater, that implies there is more than 1 message of length N that will map to the same MD value for a given H.

If that is the case, how practical is it to find these other messages? bruteforce would be of O(2^N), is there a method of time complexity less than bruteforce?

Since it's a research/theoretical question, you might want to migrate it to http://cstheory.stackexchange.com/ — Jeffrey Hantin, Jan 14 '11 at 05:43
Only for the top 5 alternate sites, unfortunately cstheory is not one of them. I have half an answer for you, though. — Jeffrey Hantin, Jan 14 '11 at 05:55

score 0 · Answer 1 · answered Jan 14 '11 at 05:51

Can't speak for the rest of the questions, but the first one is fairly simple - adding length data to the input of the md5, at any stage of the hashing process (1st block, Nth block, final block) just changes the output hash. You couldn't retrieve that length from the output hash string afterwards. It's also not inconceivable that a collision couldn't be produced from another string with the exact same length in the first place, so saying "the original string was 17 bytes" is meaningless, because the colliding string could also be 17 bytes.

e.g.

md5("abce(17bytes)fghi") = md5("abdefghi<long sequence of text to produce collision>")

is still possible.

score 0 · Answer 2 · edited Apr 13 '17 at 12:32

In the case of X.509 certificates specifically, the "comments" are not comments in the programming language sense: they are simply additional attributes with an OID that indicates they are to be interpreted as comments. The signature on a certificate is defined to be over the DER representation of the entire tbsCertificate ('to be signed' certificate) structure which includes all the additional attributes.

Hash function design is pretty deep theory, though, and might be better served on the Theoretical CS Stack Exchange.

As @Marc points out, though, as long as more bits can be modified than the output of the hash function contains, then by the pigeonhole principle a collision must exist for some pair of inputs. Because cryptographic hash functions are in general designed to behave pseudo-randomly over their inputs, collisions will tend toward being uniformly distributed over possible inputs.

EDIT: Incorporating the message length into the final block of the hash function would be equivalent to appending the length of everything that has gone before to the input message, so there's no real need to modify the hash function to do this itself; rather, specify it as part of the usage in a given context. I can see where this would make some types of collision attacks harder to pull off, since if you change the message length there's a changed field "downstream" of the area modified by the attack. However, this wouldn't necessarily impede the X.509 intermediate CA forgery attack since the length of the tbsCertificate is not modified.

@Dominar: The same construction works for the X.509 case, where attributes more or less containing binary strings of random crap can be incorporated into the tbsCertificate -- unrecognized extensions will be ignored by most processors unless they contain a 'must understand' flag, and that 'must understand' flag is part of the attribute record in the certificate so of course an adversary will not set the flag. — Jeffrey Hantin, Jan 14 '11 at 22:49
@Dominar: Also, the pigeonhole principle thing is just to show that explicitly incorporating the data length into the hash is no panacea. — Jeffrey Hantin, Jan 14 '11 at 22:51
@Dominar: That would be equivalent to appending the message length to the message, wouldn't it? Therefore you wouldn't have to modify the definition of the hash function itself, just its usage. I can see how that would make some collision attacks harder to pull off: if you change the message length there's a change in the input "downstream" of where you're rigging the state of the hash. Such a change wouldn't affect the X.509 attack since the length of the tbsCertificate is unchanged -- see http://www.win.tue.nl/hashclash/rogue-ca/ section 5.3. — Jeffrey Hantin, Jan 17 '11 at 22:20

Collision Attacks, Message Digests and a Possible solution

2 Answers2