I am trying to compare a string (in memory) to the contents of a file to see if they are the same. Boring details on motivation are below the question if anyone cares.
My confusion is that when I hash file contents, I get a different result than when I hash the string.
library(readr)
library(digest)
# write the string to the file
the_string <- "here is some stuff"
the_file <- "fake.txt"
readr::write_lines(the_string, the_file)
# both of these functions (predictably) give the same hash
tools::md5sum(the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
digest(file = the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"
# now read it back to a string and get something different
back_to_a_string <- readr::read_file(the_file)
# "here is some stuff\n"
digest(back_to_a_string)
# "03ed1c8a2b997277100399bef6f88939"
# add a newline because that's what write_lines did
orig_with_newline <- paste0(the_string, "\n")
# "here is some stuff\n"
digest(orig_with_newline)
# "03ed1c8a2b997277100399bef6f88939"
What I want to do is just digest(orig_with_newline) == digest(file = the_file)
to see if they're the same (they are) but that returns FALSE
because, as shown, the hashes are different.
Obviously I could either read the file back to a string with read_file
or write the string to a temp file, but both of those seem a bit silly and hacky. I guess both of those are actually fine solutions, I really just want to understand why this is happening so that I can better understand how the hashing works.
Boring details on motivation
The situation is that I have a function that will write a string to a file, but if the file already exists then it will error unless the user has explicitly passed .overwrite = TRUE
. However, if the file exists, I would like to check whether the string about to be written to the file is in fact the same thing that's already in the file. If this is the case, then I will skip the error (and the write). This code could be called in a loop and it will be obnoxious for the user to continually see this error that they are about to overwrite a file with the same thing that's already in it.