MD5 Encoding HTML Giving 2 Different Results

Question

Can someone help explain why this is happening? If I scrape HTML from a site using the requests module and use hashlib to get the md5 checksum I get one answer. Then if I save the html as an html file, open it, and do the same md5 checksum it gives me a different checksum.

import requests
import hashlib

resp = requests.post("http://casesearch.courts.state.md.us/", timeout=120)
html = resp.text
print("CheckSum 1: " + hashlib.md5(html.encode('utf-8')).hexdigest())

f = open("test.html", "w+")
f.write(html)
f.close()

with open('test.html', "r", encoding='utf-8') as f:
    html2 = f.read()
print("CheckSum 2: " + hashlib.md5(html2.encode('utf-8')).hexdigest())

The results look like:

CheckSum 1: e0b253903327c7f68a752c6922d8b47a
CheckSum 2: 3aaf94e0df9f1298d61830d99549ddb0

The content must be different. Are the lengths the same? What happens if you compare html and html2? — DisappointedByUnaccountableMod, Mar 16 '19 at 16:22
Try: `print(type(html), len(html), type(html2), len(html2))`. — pts, Mar 16 '19 at 16:24
You may want to add `, encoding='utf-8'` to the 1st `open(...)` call as well. — pts, Mar 16 '19 at 16:25
@snakecharmerb got it. If I remove the \r and \n they end up being the same. Still not sure why though. When it saves it to a file does it change how newlines are represented? — MatthewExpungement, Mar 16 '19 at 16:38
Use `rb` instead of `r` and `wb` instead of `w` in open to avoid newline conversion. — pts, Mar 16 '19 at 16:40

score 1 · Accepted Answer · answered Mar 16 '19 at 16:43

When reading from a file in text mode, Python may convert newline characters depending on the value of the newlines argument provided to open.

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.

This difference will affect the generated hash value.

MD5 Encoding HTML Giving 2 Different Results

1 Answers1