0

Can someone help explain why this is happening? If I scrape HTML from a site using the requests module and use hashlib to get the md5 checksum I get one answer. Then if I save the html as an html file, open it, and do the same md5 checksum it gives me a different checksum.

import requests
import hashlib

resp = requests.post("http://casesearch.courts.state.md.us/", timeout=120)
html = resp.text
print("CheckSum 1: " + hashlib.md5(html.encode('utf-8')).hexdigest())

f = open("test.html", "w+")
f.write(html)
f.close()

with open('test.html', "r", encoding='utf-8') as f:
    html2 = f.read()
print("CheckSum 2: " + hashlib.md5(html2.encode('utf-8')).hexdigest())

The results look like:

CheckSum 1: e0b253903327c7f68a752c6922d8b47a
CheckSum 2: 3aaf94e0df9f1298d61830d99549ddb0
martineau
  • 119,623
  • 25
  • 170
  • 301

1 Answers1

1

When reading from a file in text mode, Python may convert newline characters depending on the value of the newlines argument provided to open.

When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.

This difference will affect the generated hash value.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153