I am experiencing unexpected behavior with file-reading and hashing (in Python 3.7).
I have a file that simply has the text "helloworld" in it, without a newline at the end:
>>hexdump -C input.txt
00000000 68 65 6c 6c 6f 77 6f 72 6c 64 0a |helloworld.|
0000000b
I run the following Python script:
def hashit(inp):
return hashlib.md5(inp.encode('utf-8')).hexdigest()
from_var = 'helloworld'
with open('input.txt', 'r') as fo:
from_file = fo.read()
print(f' from_file : { repr(from_file) }')
print(f' from_var : { repr(from_var) }')
print(f' from_file hash : { hashit(from_file) }')
print(f' from_var hash : { hashit(from_var) }')
I get the following output:
from_file : 'helloworld\n'
from_var : 'helloworld'
from_file hash : d73b04b0e696b0945283defa3eee4538
from_var hash : fc5e038d38a57032085441e7fe7010b0
The first thing I notice is the newline at the end when I read the file. Where does this come from?
Given the trailing newline, it is not surprising that the hashes are different for the two strings.
To check, I then ran md5sum utility directly on the file:
>>md5sum input.txt
d73b04b0e696b0945283defa3eee4538 input.txt
This I don't get at all. The md5sum from the shell is the same as the md5sum of the string with the trailing newline - even though there is no newline in the file.
So my questions are:
- Why does .read() append a newline to the end of the file?
- Why does the md5sum from the command line correspond to the string **with** the trailing newline, even though the file has no newline?