19

When calculating a single MD5 checksum on a large file, what technique is generally used to combine the various MD5 values into a single value? Do you just add them together? I'm not really interested in any particular language, library or API which will do this; rather I'm just interested in the technique behind it. Can someone explain how it is done?

Given the following algorithm in pseudo-code:

MD5Digest X
for each file segment F
   MD5Digest Y = CalculateMD5(F)
   Combine(X,Y)

But what exactly would Combine do? Does it add the two MD5 digests together, or what?

channel72
  • 193
  • 1
  • 1
  • 4
  • Why would you want to do that? – AndiDog Feb 06 '10 at 18:49
  • In order to calculate MD5 values for files which are too large to fit in memory – channel72 Feb 06 '10 at 18:50
  • 5
    MD5 only has a 128-bit state tracking a 512-bit file chunk during calculaton; who cares how big the file is? – Carl Norum Feb 06 '10 at 18:51
  • @CarlNorum The problem is with interfaces where the hash implementation might keep state, but one is unable to access it. Consider the `digest` function in `pgcrypto`, which works on a block of data only, without one being able to feed additional data, simply because the state is covered behind on individual call. Therefore it's useful to know if individual hashes can be combined or not. https://www.postgresql.org/docs/9.1/static/pgcrypto.html Some users do that... http://www.postgresql-archive.org/md5-large-object-id-tp5866710p5869128.html – Thorsten Schöning May 11 '17 at 07:40

7 Answers7

16

In order to calculate MD5 values for files which are too large to fit in memory

With that in mind, you don't want to "combine" two MD5 hashes. With any MD5 implementation, you have a object that keeps the current checksum state. So you can extract the MD5 checksum at any time, which is very handy when hashing two files that share the same beginning. For big files, you just keep feeding in data - there's no difference if you hash the file at once or in blocks, as the state is remembered. In both cases you will get the same hash.

AndiDog
  • 68,631
  • 21
  • 159
  • 205
6

MD5 is an iterative algorithm. You don't need to calculate a ton of small MD5's and then combine them somehow. You just read small chunks of the the file and add them to the digest as your're going, so you never have to have the entire file in memory at once. Here's a java implementation.

FileInputStream f = new FileInputStream(new File("bigFile.txt"));
MessageDigest digest = MessageDigest.getInstance("md5");
byte[] buffer = new byte[8192];
int len = 0;
while (-1 != (len = f.read(buffer))) {
   digest.update(buffer,0,len);
}
byte[] md5hash = digest.digest();

Et voila. You have the MD5 of an entire file without ever having the whole file in memory at once.

Its worth noting that if for some reason you do want MD5 hashes of subsections of the file as you go along (this is sometimes useful for doing interim checks on a large file being transferred over a low bandwidth connection) then you can get them by cloning the digest object at any time, like so

byte[] interimHash = ((MessageDigest)digest.clone()).digest();

This does not affect the actual digest object so you can continue to work with the overall MD5 hash.

Its also worth noting that MD5 is an outdated hash for cryptographic purposes (such as verifying file authenticity from an untrusted source) and should be replaced with something better in most circumstances, such as SHA-1. For non-cryptographic purposes, such as verifying file integrity between two trusted sources, MD5 is still adequate.

Jherico
  • 28,584
  • 8
  • 61
  • 87
  • I have a use-case for needing to sum MD5s. I read multiple files in parallel and wish to have a single checksum for the entire collection (assuming files in filename alphabetical order). – Synesso Nov 12 '15 at 04:28
2

The openSSL library allows you to add blocks of data to a ongoing hash (sha1/md5) then when you have finished adding all the data you call the Final method and it will output the final hash.

You don't calculate md5 on each individual block then add it, rather you add the data to the ongoing hash method from the openssl library. This will then give you an md5 hash of all the individual data blocks with no limit on the input data size.

http://www.openssl.org/docs/crypto/md5.html#

H. Green
  • 735
  • 5
  • 10
2

This question doesn't make much sense as the MD5 algorithm takes any length input. A decent library should have functions so that you don't have to add the entire message at a single time as the message is broken down into blocks an hashed sequentially, with the block that is being processed depending only on the resultant hashes from the previous loop.

The pseudo code in the wikipedia article should give a overview of how the algorithm works.

Yacoby
  • 54,544
  • 15
  • 116
  • 120
2

A Python 2.7 example for AndiDog's answer. File 123.txt has multiple lines.

>>> import hashlib
>>> md5_A, md5_B, md5_C = hashlib.md5(), hashlib.md5(), hashlib.md5()
>>> with open('123.txt', 'r') as f_r:
...     md5_A.update(f_r.read()) # read whole contents
... 
>>> with open('123.txt', 'r') as f_r:
...     for line in f_r: # read file line by line
...         md5_B.update(line)
... 
>>> with open('123.txt', 'r') as f_r:
...     while True: # read file chunk by chunk
...         chunk = f_r.read(10)
...         if not chunk: break
...         md5_C.update(chunk)
... 
>>> md5_A.hexdigest()
'5976ddfa19bc2e1669ac3bd836101f58'
>>> md5_B.hexdigest()
'5976ddfa19bc2e1669ac3bd836101f58'
>>> md5_C.hexdigest()
'5976ddfa19bc2e1669ac3bd836101f58'

For large file that can't fit in memory, it can be read line by line or chunk by chunk. One usage of this MD5 is comparing two large files when diff command fails.

kitt
  • 110
  • 5
1

Here is a C# way to combine hash. Let's make extention methods to simplify the user code.

public static class MD5Append
{
    public static int Append(this MD5 md5, byte[] data)
    {
        return md5.TransformBlock(data, 0, data.Length, data, 0);
    }

    public static void AppendFinal(this MD5 md5, byte[] data)
    {
        md5.TransformFinalBlock(data, 0, data.Length);
    }
}

Usage:

   using (var md5 = MD5CryptoServiceProvider.Create("MD5"))
        {
            md5.Initialize();

            var abcBytes = Encoding.Unicode.GetBytes("abc");
            md5.Append(abcBytes);
            md5.AppendFinal(abcBytes);

            var h1 = md5.Hash;

            md5.Initialize(); // mandatory
            var h2= md5.ComputeHash(Encoding.Unicode.GetBytes("abcabc"));

            Console.WriteLine(Convert.ToBase64String(h1));
            Console.WriteLine(Convert.ToBase64String(h2));
        }

h1 and h2 are the same. That's it.

davidkonrad
  • 83,997
  • 17
  • 205
  • 265
1

Most digest calculation implementations allow you to feed them the data in smaller blocks. You can't combine multiple MD5 digests in a way that the result will be equal to the MD5 of the entire input. MD5 does some padding and uses the number of proccessed bytes in the final stage which makes the original engine state unrecoverable from the final digest value.

x4u
  • 13,877
  • 6
  • 48
  • 58
  • So the following is a great example of how to not implement multiple MD5 combinations? That user is simply concatenating multiple individual hashes for individual blocks of a large file. http://www.postgresql-archive.org/md5-large-object-id-tp5866710p5869128.html – Thorsten Schöning May 11 '17 at 07:42
  • @Thorsten: It can be appropriate to concatenate hash sums of fixed size blocks and then hash the concatenated string again to get a single hash value. The resulting hash sum is just not the same that you would get if you had hashed the whole file. This means the concatenation is useless if you need to compare it with one that's not calculated this way but if you define your own protocol you can decide to define a certain block size and calculate your hashes always this way. The quality of the hash is not worse than the original hash function. The edonkey p2p file sharing used hashes like this. – x4u May 11 '17 at 09:30