Python Md5 Hashing, trailing newline when reading file

Question

I am experiencing unexpected behavior with file-reading and hashing (in Python 3.7).

I have a file that simply has the text "helloworld" in it, without a newline at the end:

>>hexdump -C input.txt
00000000  68 65 6c 6c 6f 77 6f 72  6c 64 0a                 |helloworld.|
0000000b

I run the following Python script:

def hashit(inp):
    return hashlib.md5(inp.encode('utf-8')).hexdigest()

from_var = 'helloworld'

with open('input.txt', 'r') as fo:
    from_file = fo.read()

print(f' from_file      : { repr(from_file) }')
print(f' from_var       : { repr(from_var) }')

print(f' from_file hash : { hashit(from_file) }')
print(f' from_var  hash : { hashit(from_var) }')

I get the following output:

from_file      : 'helloworld\n'
from_var       : 'helloworld'
from_file hash : d73b04b0e696b0945283defa3eee4538
from_var  hash : fc5e038d38a57032085441e7fe7010b0

The first thing I notice is the newline at the end when I read the file. Where does this come from?

Given the trailing newline, it is not surprising that the hashes are different for the two strings.

To check, I then ran md5sum utility directly on the file:

>>md5sum input.txt 
d73b04b0e696b0945283defa3eee4538  input.txt

This I don't get at all. The md5sum from the shell is the same as the md5sum of the string with the trailing newline - even though there is no newline in the file.

So my questions are:

Why does .read() append a newline to the end of the file?
Why does the md5sum from the command line correspond to the string **with** the trailing newline, even though the file has no newline?

How did you make sure that your file doesn't contain newline at the end? Try `open('input.txt', 'rb').read()` and if you see the newline there, so it must be there. — mehdix, Nov 28 '18 at 09:39
sorry my bad. I missed the md5sum part in your original post. — mehdix, Nov 28 '18 at 10:24
`68 65 6c 6c 6f 77 6f 72 6c 64 0a` - the `0a` is the newline, since its 11 characters and not 10. — Burhan Khalid, Nov 28 '18 at 10:24
As Burkhan suggesetd `0a` is the newline. You can see for yourself: `import codecs ; codecs.decode('0a', 'hex')` gives `\n` — mehdix, Nov 28 '18 at 10:35

Python Md5 Hashing, trailing newline when reading file

0 Answers0