5

Let me take you on a journey..

I'm trying to download and verify Apache Spark (http://www.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz) via MD5 on a fresh Debian (Jessie) machine.

The md5sum script already existed on that machine without me needing to do anything.

As such I continue by downloading the MD5 checksum (http://www.apache.org/dist/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz.md5) to the same directory as the downloaded Spark, and then I execute:

md5sum -c spark-1.6.0-bin-hadoop2.6.tgz.md5

This fails with:

md5sum: spark-1.6.0-bin-hadoop2.6.tgz.md5: no properly formatted MD5 checksum lines found

And so I check the contents via cat spark-1.6.0-bin-hadoop2.6.tgz.md5:

spark-1.6.0-bin-hadoop2.6.tgz: 62 4B 16 1F 67 70 A6 E0  E0 0E 57 16 AF D0 EA 0B

That's the whole file. Looks decent to me - maybe the Spark download was actually bad? Before taking that assumption I'll first see what the MD5 is now via md5sum spark-1.6.0-bin-hadoop2.6.tgz:

624b161f6770a6e0e00e5716afd0ea0b  spark-1.6.0-bin-hadoop2.6.tgz

Hmm, that's a completely different format - but if you look hard enough you'll notice that the numbers and letters are actually the same (except lowercase and without spaces). It looks like the md5sum that comes with Debian is following a different standard.

Maybe there's another way I can run this command? Lets try md5sum --help:

Usage: md5sum [OPTION]... [FILE]...
Print or check MD5 (128-bit) checksums.
With no FILE, or when FILE is -, read standard input.

  -b, --binary         read in binary mode
  -c, --check          read MD5 sums from the FILEs and check them
      --tag            create a BSD-style checksum
  -t, --text           read in text mode (default)

The following four options are useful only when verifying checksums:
      --quiet          don't print OK for each successfully verified file
      --status         don't output anything, status code shows success
      --strict         exit non-zero for improperly formatted checksum lines
  -w, --warn           warn about improperly formatted checksum lines

      --help     display this help and exit
      --version  output version information and exit

The sums are computed as described in RFC 1321.  When checking, the input
should be a former output of this program.  The default mode is to print
a line with checksum, a character indicating input mode ('*' for binary,
space for text), and name for each FILE.

GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report md5sum translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/md5sum>
or available locally via: info '(coreutils) md5sum invocation'

Okay, --tag seems to change the format. Lets try md5sum --tag spark-1.6.0-bin-hadoop2.6.tgz:

MD5 (spark-1.6.0-bin-hadoop2.6.tgz) = 624b161f6770a6e0e00e5716afd0ea0b

Indeed, that is a different format, but still not the right one.. So I look to the instructions on the Apache Download Mirrors page and find the following text:

Alternatively, you can verify the MD5 hash on the file. A unix program called md5 or md5sum is included in many unix distributions. It is also available as part of GNU Textutils...

So I follow that link and find that Textutils was merged to Coreutils in 2003 - so I actually want the md5sum from Coreutils then. However you can see at the bottom of the md5sum --help dump that it's already from Coreutils.

That might mean that my Coreutils are out of date. So I'll apt-get update && apt-get upgrade coreutils, but then I found out that:

Calculating upgrade... coreutils is already the newest version.

That's a dead end then.. but wait a moment, they said "md5 or md5sum"! Lets check out that lead.

The md5 script doesn't exist yet, so I'll try apt-get install md5:

E: Unable to locate package md5

And now I'm lost, and so turn to Google and then StackOverflow for help.. Now here I am.

So what's with the two different MD5 file formats and how can I deal with this issue (and finally verify my Apache Spark)?

Bilal Akil
  • 4,716
  • 5
  • 32
  • 52

1 Answers1

3

I believe gpg --print-md md5 spark-1.6.0-bin-hadoop2.6.tgz should match the .md5 file's content.

There were problems with the format of the md5/sha files 'cause the script that builds the spark release uses gpg --print-md md5 to create the signature files. See: https://issues.apache.org/jira/browse/SPARK-5308

delephin
  • 1,085
  • 1
  • 8
  • 10
  • Ahh, fascinating.. why do they tell us to use coreutil's `md5sum` then (rhetorical question). An actual question though: what's the right way to emulate the `-c` in `md5sum -c` using this `gpg` script? The "naive" way would be to simply diff the files and hope nothing's different - is there a better way? Also, from the comments on that issue it's apparent that it was changed to `md5`/`md5sum` (with a merged pull request, explaining why they tell us to use one of them) - but why do we still have the `gpg` formatted data coming out then? – Bilal Akil Feb 07 '16 at 08:11
  • The bug was reported for some maven incompatibility so they only modified how those signature files were created. `gpg` is still used to create the final package' signature. – delephin Feb 07 '16 at 08:33
  • Roger, so how would you check the downloaded file then? – Bilal Akil Feb 07 '16 at 08:47
  • 4
    You'll have to go with something like `gpg --print-md MD5 spark-1.6.0-bin-hadoop2.6.tgz | diff - spark-1.6.0-bin-hadoop2.6.tgz.md5` cause I don't think there's an automatic way to do this. You can always download the `spark-1.6.0-bin-hadoop2.6.tgz.asc` file and do something like `gpg --verify spark-1.6.0-bin-hadoop2.6.tgz.asc spark-1.6.0-bin-hadoop2.6.tgz` after getting the public key – delephin Feb 07 '16 at 09:20