1

An MD5 hash should return the same value irrespective of who performs the hash calculation and where.

Yet using three different methods on the same file, we see three different answers (!?).

Here's the file.

The MD5 hash according to Amazon Web Services is:

library(dplyr)
"https://collidr-api.s3-ap-southeast-2.amazonaws.com/pfd.RDS" %>% curlGetHeaders %>% .[6] %>% trimws %>% 
  strsplit(., "ETag: ") %>% .[[1]] %>% .[2] %>% 
  { substr(., 2, nchar(.)) } %>% { substr(., 1, nchar(.) - 1)}
# "a921f713fbd730a51814fb6602048c16"

The MD5 hash using the digest library is

library(digest)
digest("Downloads/pfd.RDS", algo=c("md5"))
# "2b049aba0269e46d35780c3e7d29a916"

And the MD5 hash using openssl library is

library(openssl)
md5("Downloads/pfd.RDS")
# "8ceabf9bdd146ed12ba89533cd593d12"

I can't explain this. I expected all three values to be the same since they're all applying the same algorithm (MD5) to the same file, yet all 3 are different.

Question

Why aren't the hash values the same irrespective of the method used to generate the MD5 hash of the file, and most importantly, how do I calculate the hash in R such that it matches the MD5 hash provided by AWS (i.e .a921f713fbd730a51814fb6602048c16)?

UPDATE

In mac terminal md5 Downloads/pfd.RDS returns a921f713fbd730a51814fb6602048c16 (consistent with the AWS value). It's still not clear why digest::digest() and openssl::md5() values are different.

Community
  • 1
  • 1
stevec
  • 41,291
  • 27
  • 223
  • 311

1 Answers1

3

If you want to hash the contents of the file at that path, you need to tell each of the functions that. Try

digest("Downloads/pfd.RDS", file=TRUE, algo="md5")

and

md5(file("Downloads/pfd.RDS", open="rb"))

otherwise you are hashing the path name itself.

These return the same values in the simple case of

cat("hello", file="hello.txt")
digest("hello.txt", file=TRUE, algo="md5")
# [1] "5d41402abc4b2a76b9719d911017c592"
md5(file("hello.txt", open="rb"))
# md5 5d:41:40:2a:bc:4b:2a:76:b9:71:9d:91:10:17:c5:92 
stevec
  • 41,291
  • 27
  • 223
  • 311
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • `digest("Downloads/pfd.RDS", file=TRUE, algo="md5")` returns `# [1] "a921f713fbd730a51814fb6602048c16"` (expected), but `md5(file("Downloads/pfd.RDS"))` returns `md5 3a:b7:0d:89:31:89:8b:dd:c3:8b:65:04:6c:ae:6a:25` – stevec Jan 21 '20 at 04:06
  • You are running both these commands in the exact same R session and are getting different responses? Does the example i included return the same value for both functions? – MrFlick Jan 21 '20 at 04:17
  • Yes, both commands are in the exact same R session (one right after the other). Your example returns the same value for both functions. – stevec Jan 21 '20 at 04:24
  • 2
    Use `file("Downloads/pfd.RDS", open = "rb")` so the connection is opened for reading in binary mode. – Ritchie Sacramento Jan 21 '20 at 04:26
  • @H1 awesome knowledge. How did you figure that out? (I had come to a dead end) – stevec Jan 21 '20 at 04:29
  • 1
    @user5783745 A different hash meant the file had undergone a change and the file being opened in text mode rather than binary mode was a likely candidate. – Ritchie Sacramento Jan 21 '20 at 04:49