4

I am trying to validate two files downloaded from a server. The first contains data and the second file contains the MD5 hash checksum.

I created a function that returns a hexdigest from the data file like so:

def md5(fileName):
    """Compute md5 hash of the specified file"""
    try:
        fileHandle = open(fileName, "rb")
    except IOError:
        print ("Unable to open the file in readmode: [0]", fileName)
        return
    m5Hash = hashlib.md5()
    while True:
        data = fileHandle.read(8192)
        if not data:
            break
        m5Hash.update(data)
    fileHandle.close()
    return m5Hash.hexdigest()

I compare the files using the following:

file = "/Volumes/Mac/dataFile.tbz"
fileHash = md5(file)

hashFile = "/Volumes/Mac/hashFile.tbz.md5"
fileHandle = open(hashFile, "rb")
fileHandleData = fileHandle.read()

if fileHash == fileHandleData:
    print ("Good")
else:
    print ("Bad")

The file comparison fails so I printed out both fileHash and fileHandleData and I get the following:

[0] b'MD5 (hashFile.tbz) = b60d684ab4a2570253961c2c2ad7b14c\n'
[0] b60d684ab4a2570253961c2c2ad7b14c

From the output above the hash values are identical. Why does the hash comparison fail? I am new to python and am using python 3.2. Any suggestions?

Thanks.

David
  • 14,205
  • 20
  • 97
  • 144
  • You're not showing us your function, nor how you're printing the variables. It's obvious the values you show are different, but not what type (one is the repr() of a byte string, the other is hex data). You may want to take a look at http://cfv.sourceforge.net/ – Yann Vernier May 02 '11 at 05:58

4 Answers4

1

You are comparing a hash value to the contents of the fileHandle. You need to get rid of the MD5 (hashFile.tbz) = part as well as the trailing newline, so try:

if fileHash == fileHandleData.rsplit(' ', 1)[-1].rstrip():
    print ("Good")
else:
    print ("Bad")

keep in mind that in Python 3, rsplit() and rstrip() do not support the buffer API and only operate on strings. Hence, as Fred Nurk correctly added, you also need to encode/decode fileHandleData/fileHash (a byte buffer or a (Unicode) string, respectively).

Michael Foukarakis
  • 39,737
  • 6
  • 87
  • 123
1

The comparison fails for the same reason this is false:

a = "data"
b = b"blah (blah) - data"
print(a == b)

The format of that .md5 file is strange, but if it is always in that format, a simple way to test would be:

if fileHandleData.rstrip().endswith(fileHash.encode()):

Because you have fileHash as a (Unicode) string, you have to encode it to bytes to compare. You may want to specify an encoding rather than use the current default string encoding.

If that exact format is always expected, it would be more robust to use a regex to extract the hash value and possibly check the filename.

Or, more flexibly, you could test substring presence:

if fileHash.encode() in fileHandleData:
Fred Nurk
  • 13,952
  • 4
  • 37
  • 63
  • Thank you for the help. I used your advice and made the following modifications `if fileHash in fileHandleData.decode("utf-8"):`. – David May 02 '11 at 11:00
0

Try "fileHash.strip("\n")...then compare the two. That should fix the problem.

0

The hash values are identical, but the strings are not. You need to get the hex value of the digest, and you need to parse the hash out of the file. Once you have done those you can compare them for equality.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358