13

When calculating the md5 sum of large files, I see a single cpu core jump to 100% for however long it takes, leaving all other cores idle.

My rudimentary understanding of md5 is the entire process is completely linear, where values are dependent on all previous values read, and there is nothing we can do to make it multi-threaded. Is this true?

Or is there a way to break the files into sections, calculate <something> over multiple parts using multi-cores, and then combine those <something> values into the final md5?

The library we're using to calculate the md5sum is http://libmd5-rfc.sourceforge.net/ but I'd switch to a different one if it was possible to break the md5sum across multiple cores so it completes faster.

(Note: changing to something other than md5 is not the question, nor can it be done because of the other closed systems to which this interfaces. Nor is this question about the safety of using md5.)

Stéphane
  • 19,459
  • 24
  • 95
  • 136
  • 4
    Did you Google for this? One of the first hits for "parallel implementation MD5" is: http://wwwcip.cs.fau.de/~spjsschl/md5.pdf, which seems to show that the short answer is "Yes, it can." – Jerry Coffin May 23 '12 at 19:19
  • +1 for your "note". Though the fact that you're aware of the issues implies that maybe you should consider doing something about them... – Ben May 23 '12 at 19:22
  • 4
    @JerryCoffin I think that article is a bit misleading. I gathered that they were parallelizing the multiple iterations of MD5 for password hashing, not parallelizing the MD5 algorithm itself. Their other optimization was to use one large 128-bit SSE register instead of 4 32-bit registers. – greg May 23 '12 at 19:32
  • 4
    Read that paper and basically they didn't do anything to multi-thread MD5 computation on a single input. Their multi-threading is only for computing multiple MD5 hashes at the same time, which would only be useful for the op if he has more than one large file to hash. They summed it up in section 3.2: "Because there is a non-removable data-dependency be-tween every step of a MD5 teration, it is not possible to speed up the runtime of a single iteration" – billc.cn May 23 '12 at 19:35
  • pipelining is also parallelization. MD5 runs 64 rounds, and those rounds can be pipelined. That's what they've done. – jthill May 23 '12 at 20:00

1 Answers1

12

No you cannot break it apart at the file level. MD5 maintains a state as it runs through the data.

pizza
  • 7,296
  • 1
  • 25
  • 22