0

The below code shows that three files which are on a UNC share hosted on another machine have the same hash. It also shows that local files have different hashes. Why would this be? I feel that there is some UNC consideration that I don't know about.

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> fn_a = '\\\\some.host.com\\Shares\\folder1\\file_a'
>>> fn_b = '\\\\some.host.com\\Shares\\folder1\\file_b'
>>> fn_c = '\\\\some.host.com\\Shares\\folder2\\file_c'
>>> fn_d = 'E:\\file_d'
>>> fn_e = 'E:\\file_e'
>>> fn_f = 'E:\\folder3\\file_f'
>>> f_a = open(fn_a, 'r')
>>> f_b = open(fn_b, 'r')
>>> f_c = open(fn_c, 'r')
>>> f_d = open(fn_d, 'r')
>>> f_e = open(fn_e, 'r')
>>> f_f = open(fn_f, 'r')
>>> hashlib.md5(f_a.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_b.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_c.read()).hexdigest()
'54637fdcade4b7fd7cabd45d51ab8311'
>>> hashlib.md5(f_d.read()).hexdigest()
'd2bf541b1a9d2fc1a985f65590476856'
>>> hashlib.md5(f_e.read()).hexdigest()
'e84be3c598a098f1af9f2a9d6f806ed5'
>>> hashlib.md5(f_f.read()).hexdigest()
'e11f04ed3534cc4784df3875defa0236'

EDIT: To further investigate the problem, I also tested using a file from another host. It appears that changing the host will change the result.

>>> fn_h = '\\\\host\\share\\file'
>>> f_h = open(fn_h, 'r')
>>> hashlib.md5(f_h.read()).hexdigest()
'f23ee2dbbb0040bf2586cfab29a03634'

...but then I tried a different file on the new host, and got a new result!

>>> fn_i = '\\\\host\\share\\different_file'
>>> f_i = open(fn_i, 'r')
>>> hashlib.md5(f_i.read()).hexdigest()
'a8ad771db7af8c96f635bcda8fdce961'

So, now I'm really confused. Could it have something to do with the fact that the original host is a \\host.com format and the new host is a \\host format?

jamylak
  • 128,818
  • 30
  • 231
  • 230
Shaun
  • 445
  • 4
  • 15
  • In the hashlib test part you call `f1.read()` both times. You also did not reset the file pointer so each call will return 0 bytes so unsurprisingly the result is the same as `hashlib.md5(b'').hexdigest()` which returns the md5 digest value you have reported. – patthoyts Jun 13 '15 at 08:34
  • @patthoyts, thanks, I fixed those issues in the code to hopefully clarify the root problem that I am seeing. UNC paths to a remote machine still result in the same hash, while local files do not. – Shaun Jun 13 '15 at 16:26
  • Still need help on this question! – Shaun Jun 16 '15 at 15:04

3 Answers3

2

I did some additional research based on the comments and answers everyone provided. I decided I needed to study permutations of these two features of the code:

  1. A raw string literal is used for the path name, i.e. whether or not:
    A. The file path string is raw with single backslashes in the path, vs.
    B. The file path string is not raw with double backslashes in the path

    (FYI to those who don't know, a raw string is one which is proceeded by an "r" like this: r'This is a raw string')

  2. The open function mode is r or rb.
    (FYI again to those who don't know, the b in rb mode indicates to read the file as binary.)

The results demonstrated:

  • The string literal / backslashes make no difference in whether or not the hashes of different files are different
  • My error was not opening the file in binary mode. When using rb mode in open, I got different results.

Yay! And thanks for the help.

Shaun
  • 445
  • 4
  • 15
1

Use f1.seek(0) if you intend to use it again, otherwise it would be a file completely read and calling read() again would just return a empty string.

satoru
  • 31,822
  • 31
  • 91
  • 141
  • thanks, I fixed those issues in the code to hopefully clarify the root problem that I am seeing. UNC paths to a remote machine still result in the same hash, while local files do not. – Shaun Jun 13 '15 at 16:27
1

I don't reproduce your problem. I'm using Python 3.4 on Windows 7 here with the following test script which accesses files on a network hard disk:

import sys, hashlib
def main():
    fn0 = r'\\NAS\Public\Software\Backup\Test\Vagrantfile'
    fn1 = r'\\NAS\Public\Software\Backup\Test\z.xml'
    with open(fn0, 'rb') as f:
        h0 = hashlib.md5(f.read())
        print(h0.hexdigest())
    with open(fn1, 'rb') as f:
        h1 = hashlib.md5(f.read())
        print(h1.hexdigest())

if __name__ == '__main__':
    sys.exit(main())

Running this results in two different hash values (as expected):

c:\src\python>python hashtest.py
8af202dffb88739c2dbe188c12291e3d
2ff3db61ff37ca5ceac6a59fd7c1018b

If reading the file contents returns different data for the remote files then passing that data into md5 has to result in different hash values. You might want to print out the first 80 bytes of each file as a check that you are getting what you expect.

patthoyts
  • 32,320
  • 3
  • 62
  • 93